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so  that  a  write  operation  from  a  concurrent  transaction  will 
not  interfere  improperly  with  the  read  operation.  However, 
setting  a  lock  or  leaving  a  timestamp  with  a  data  element  is 
an  expensive  operation.  The  purpose  of  the  current  research 
is  to  seek  ways  to  reduce  the  overhead  of  synchronization  of 
certain  types  of  read  accesses,  and  at  the  same  time  achieving 
the  goal  of  serializability .  ^ 

To  this  end,  a  hierarchical  structure  is  proposed  here  as 
the  means  for  analyzing  opportunities  of  reducing  concurrency 
control  overhead  in  a  database  application.  A  theorem  is 
proven  which  indicates  that  in  a  hierarchically  decomposed 
database  one  cam  construct  yet  a  third  type  of  partial  order  of 
all  transactions  which  is  different  from  the  simple  commit 
order  and  the  simple  initiation  order  that  hve  been  used  in 
the  existing  algorithms.  This  new  type  of  partial  order, 
called  topologically  follows,  enables  a  new  algorithm 
for  concurrency  control  to  be  devised  which  is  belived  to  have 
the  potential  of  reducing  the  overhead  of  read  access  synchron¬ 
ization  in  a  database  system. 

The  report  develops  the  theory  that  supports  the  correctness 
of  such  analysis  and  algorithms.  An  implementation  scheme 
for  the  proposed  algorithm,  the  Hierarchical  Timestamping 
Algorithm,  is  described.  The  complexity  of  the  hierarchical 
database  decomposition  methodology  is  analyzed  and  a  heuristics 
procedure  is  proposed.  The  HDD  approach  is  applied  to  three 
different  application  areas  in  which  its  effect  on  relieving 
database  contention  and  on  structuring  databases  and  trans¬ 
actions  so  as  to  reduce  concurrency  control  overhead  is 
demonstrated. 
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ABSTRACT 


Increasing  demand  for  information  system  capacity  has  prompted  researchers 
to  find  vays  to  improve  the  computer  systems  used  m  information  processing. 
Database  management  systems  (DBMS's)  represent  one  such  effort  to  provide  bet¬ 
ter  information  services  at  lover  costs.  In  order  to  minimize  response  time 
and  maximize  throughput,  it  is  desirable  that  a  DBMS  supports  multiple  users 
at  the  same  time,  allowing  multiple  transactions  to  run  in  parallel.  However, 
for  the  purpoaa  of  maintaining  database  consistency  and  integrity,  such 
parallelism  must  be  properly  controlled.  For  example,  suppose  two  trans¬ 
actions  that  transfer  money  into  the  same  banX  account  are  run  m  parallel. 
Without  proper  coordination  between  the  two  transactions,  it  is  possible  that 
both  of  them,  would  read  the  same  old  balance,  modify  it  independently  and 
write  these  independently-modified  balances  bach  to  the  database.  If  this 
happens,  the  final  balance  will  reflect  the  result  of  only  one,  but  not  both, 
of  the  transactions,  causing  one  transaction  to  be  lost.  To  prevent  such  vio¬ 
lation  of  database  consistency  and  integrity  from  taxing  place,  the 
concurrency  control  facility  is  an  indispensable  component  of  the  database 
management  system. 

A  generally  accepted  criterion  for  correctness  of  a  concurrency  control 
algorithm  is  the  criterion  of  iZKiaLuabiUXy  of  transactions.  The  classical 
approaches  to  enforcing  serializability  are  the  two'phtue.  lodung  technique 
and  the  ZimtAiamp  ok duUng  technique.  The  first  technique  ensures 
serializability  by  imposing  a  partial  order  on  all  transactions  based  on  their 
commit  order,  while  the  second  on  their  initiation  order.  Either  approach 
requires  that  a  read  operation  from  a  transaction  be  M.g<.6ttKtd  (in  the  form 
of  either  a  read  timestamp  or  a  read  loch) ,  so  that  a  write  operation  from  a 
concurrent  transaction  will  not  interfere  improperly  with  the  read  operation. 
However,  setting  a  locx  or  leaving  a  timestamp  with  a  data  element  is  an 
expensive  operation.  The  purpose  of  the  current  research  is  to  seek  ways  to 
reduce  the  overhead  of  synchronizing  certain  types  of  read  accesses,  and  at 
the  same  time  achieving  the  goal  of  serializability . 

To  this  end,  a  hierarchical  structure  is  proposed  here  as  the  means  for 
analyzing  opportunities  of  reducing  concurrency  overhead  in  a  database  appli¬ 
cation.  A  theorem  is  discovered  which  indicates  that  in  a  hierarhically 
decomposed  database  one  can  construct  yet  a  third  type  of  partial  order  of  all 


transactions  which  is  different  from  the  simple  commit  order  and  the  simple 
initiation  order  that  have  been  used  in  the  existing  algorithms.  This  new 
type  of  partial  order,  called  topoloQjjLaJULy  bottom,  enables  a  new  algorithm 
for  concurrency  control  to  be  devised  which  is  believed  to  have  the  ability  of 
effectively  reducing  the  overhead  of  read  access  synchronization  in  a  database 
system  that  endorses  a  hierarchical  decomposition  of  the  database. 

The  thesis  develops  the  theory  that  supports  the  correctness  of  such  analy¬ 
sis  and  algorithms.  An  implementation  scheme  for  the  proposed  algorithm,  the 
Hierarchical  Timestamping  Algorithm,  is  described.  The  complexity  of  the 
hierarchical  database  decomposition  methodology  is  analyzed  and  a  heuristics 
procedure  is  proposed.  The  HDD  approch  is  applied  to  three  different  applica¬ 
tion  areas  in  which  its  effect  on  relieving  database  contention  and  on 
structuring  databases  and  transactions  so  as  to  reduce  concurrency  control 
overhead  is  demonstrated. 
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1.0  INTRODUCTION  AMD  THE  SCOPE  OF  THE  THESIS 


1.1  INTRODUCTION  TO  THE  CONCURRENCY  CONTROL  PROBLEM 

Increasing  demand  for  information  system  capacity  has  prompted 
researchers  to  find  ways  to  improve  the  computer  systems  used  in  infor¬ 
mation  processing.  Database  management  systems  (DBKS's)  represent  one 
such  effort  to  provide  better  information  services  at  lower  costs.  In 
order  to  minimize  response  time  and  maximize  throughput,  it  is  desirable 
that  a  DBMS  supports  multiple  users  at  the  same  time,  allowing  multiple 
transactions  to  run  in  parallel.  However,  for  the  purpose  of  maintain¬ 
ing  database  consistency  and  integrity,  such  parallelism  must  be 
properly  controlled.  Therefore  the  concurrency  control  facility  is  an 
indispensable  component  of  the  database  management  system. 

The  role  of  a  concurrency  control  mechanism  is  to  preserve  the 
atomicity  of  a  user  transaction,  i.e.,  it  will  prevent  the  processing  of 
a  transaction  from  being  tAAontotuly  itUeM.ta.vzd  with  other  concurrenct 
transactions,  so  that  each  transaction  sees  a  consistent  database  state 
and,  if  necessary,  cam  be  recovered  or  bached  out  as  a  single  unit. 

A  typical  example  of  database  inconsistency  induced  by  improper 
interleaving  of  steps  of  concurrenct  transactions  is  shown  in  Figure  l. 


Figure  i 


Currently  there  is  $100  in  Smith's  Account, 
tl:  Deposit  $50  into  Smith's  account 
t2:  Withdraw  $50  from  Smith's  account. 
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An  example  of  .database  inconsistency  induced  by 


concurrent  processing. 


As  shown  in  the  figure,  two  transactions  simultaneously  accessing  the 
same  piece  of  data  may  result  in  lost  update,  leaving  the  database  in  an 
incorrect  state.  This  is  due  to  the  fact  that  the  database  access  steps 
from  these  two  transactions  sure  not  properly  interleaved.  By  requiring 
that  the  database  management  system  exercise  control  over  such  inter* 
leaving  of  concurrent  transactions,  the  undesirable  effects  can  be 
eliminated. 

1.2  THE  SCOPE  OF  CURRENT  RESEARCH 

1.2.1  INTRODUCTION 

A  generally  accepted  criterion  for  correctness  of  a  concurrency  con¬ 
trol  algorithm  is  the  criterion  of  serializability  of  transactions, 
which  means  that  interleaving  is  harmless  so  long  as  one  can  show  that 
the  net  effect  of  such  interleaving  is  e.quivaJLe.nt  to  some.  sejuaJLLzed 
processing  (i.e.,  one  after  another)  of  all  the  transactions  involved. 
In  the  above  example,  it  is  apparent  that  the  steps  of  the  two  trans¬ 
actions  are  scheduled  in  such  a  way  that  there  exists  no  serialized 
schedule  (i.e.,  either  t,  after  t2  or  t2  after  t,)  that  would  have  gen¬ 
erated  the  same  net  effect.  Therefore  the  schedule  of  these  steps  is 
not  serializable,  and  therefore  incorrect. 


The  classical  approach  to  enforcing  serializability  is  the  two~phcue 
locking  technique.  This  technique  locks  up  data  elements  being  accessed 
by  one  transaction  and  blocks  other  transactions  from  operating  on  these 
data  elements  until  the  first  transaction  is  finished.  Another  approach 
to  dealing  with  this  problem  is  that  of  the  timestamp  ondtning 
technique.,  which  stamps  the  data  elements  with  the  timestamps  of  the 
transactions  that  have  operated  on  them  so  as  to  prevent  violation  of  a 
pre-determined  order  from  taking  place. 

Either  approach  requires  that  a  read  operation  from  a  transaction  be 
Aegi^ttned  (in  the  form  of  either  a  read  timestamp  or  a  read  lock) ,  so 
that  a  write  operation  from  a  concurrent  transaction  will  not  interfere 
improperly  with  the  read  operation.  Setting  a  lock  or  leaving  a 
timestamp  with  a  data  element  is  an  expensive  operation.  It  not  only 
incurs  a  write  operation  in  the  database  (in  the  form  of  setting  the 
read  lock  or  writing  the  timestamp) ,  but  also  potentially  causes  delays 
for  concurrent  transactions  that  might  be  avoided. 

The  purpose  of  the  current  research  is  to  seek  ways  to  reduce  the 
overhead  of  synchronizing  certain  types  of  read  accesses,  and  at  the 
same  time  achieving  the  goal  of  serializability.  A  more  formal  and  com¬ 
prehensive  overview  will  be  given  in  the  next  chapter  of  the  literature 
in  concurrency  control,  including  efforts  up  to  now  that  have  aimed  at 
similar  goals.  Here  we  intuitively  motivate  this  research  further 
through  a  brief  discussion  of  some  empirical  observations  of  the  impact 
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of  concurrency  control  on  performance  of  database  systems  and  an  example 
that  illustrates  the  scope  of  this  thesis  research. 

1.2.2  CONCURRENCY  CONTROL  AS  A  POTENTIAL  BOTTLENECK  OF  DBKS'S 

It  vas  previously  pointed  out  that  setting  lochs  and  leaving 
timestamps  are  potentially  expensive  operations.  We  now  present  a  more 
specific  discussion  on  how  the  concurrency  control  activities  could  be  a 
threat  to  performance.  We  will  approach  this  issue  from  two  different 
perspectives.  The  first  is  the  impact  of  the  processing  overhead  of  the 
concurrency  control  activities  on  the  effective  level  of 
multiprocessing.  The  second  is  their  impact  on  the  level  of  transaction 
concurrency  achievable  and  on  overall  performance. 

1.2. 2. I  PROCESSING  OVERHEAD  OF  CONCURRENCY  CONTROL 

In  <Gray78>  it  was  reported  that,  in  System  R,  the  loch  manager  com¬ 
prises  3  percent  of  the  code  and  consumes  about  10  percent  of  the 
instruction  execution  time.  The  latter  obviously  would  vary  a  great 
deal  depending  on  applications.  A  loch-unloch  pair  typically  costs  350 
instructions.  Mo  comparable  statistics  on  timestamping  algorithm  have 
been  reported,  but  it  is  believed  that  the  overhead  of  timestamping 
would  not  be  significantly  different  from  that  of  loching,  since  both 
activities  basically  involve  table  manipulation  and  updates. 


Talking  these  numbers  as  reference,  this  type  of  processing  overhead 
of  concurrency  control  in  general  would  not  amount  to  unacceptable  load 
on  the  computer  system.  However,  this  overhead  does  tend  to  grow  as  the 
required  throughput  of  tramsactions  amd  transaction  sizes  increase, 
since  both  of  these  increases  would  cause  lock  tables,  etc.,  to  grow  in 
size  and  complexity,  making  each  visit  to  these  tables  more 
time-consuming.  In  addition,  as  pointed  out  in  <FriedmamSO> ,  practical 
experiences  with  DBMS's  such  as  IBM's  IMS  show  that  performance  of  the 
system  tables  (e.g.,  lock  tables)  associated  with  the  concurrency  con¬ 
trol  facility  (or  the  'program  isolation  facility'  in  IMS's  terms)  has 
important  bearings  on  the  overall  performance  of  the  DBMS  and  should  be 
taken  into  consideration  when  designing  tramsactions,  especially  trams¬ 
actions  that  involve  many  granule  accesses  and  updates.  For  example, 
existence  of  long  transactions,  if  not  carefully  handled,  often  severely 
degrades  performance  of  unrelated  short  transactions  through  interfer¬ 
ence  of  the  concurrency  control  facility  itself. 

However,  a  potentially  more  serious  problem  of  concurrency  control 
overhead  is  the  fact  that  concurrency  control  has  to  be  performed  in  a 
critical  section  in  which  no  other  processes  can  interrupt.  This  means 
that  if  a  transaction  is  currently  manipulating  a  lock  table,  no  other 
tramsactions  will  be  allowed  to  perform  locking  and  unlocking  involving 
the  saune  lock  table  until  the  former  is  finished.  It  is  this  contention 
for  the  right  to  enter  a  critical  section  for  concurrency  control  pur¬ 
poses  that  may  prove  to  be  a  serious  limitation  in  building  a 
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multi-processor  system  with  a  high  degree  of  multiprocessing.  For  exam¬ 
ple,  if  concurrency  control  consumes  about  10  percent  of  the  instruction 
execution  time  in  a  particular  critical  section  per  transaction,  then 
the  maximum  number  of  effective  processors  a  multiprocessor  transaction 
processing  system  may  have  is  10.  While  this  is  not  a  problem  when  the 
degree  of  multiprocessing  required  is  low,  it  is  certainly  an  issue  in 
designing  large  multiprocessor  database  computers  such  as  the  one  to  be 
described  in  Section  9.3.  Seducing  visit  frequencies  to  these  critical 
sections,  which  means  reducing  frequencies  of  setting  locks  or 
timestamping,  would  have  a  significant  impact  on  designing  multiprocess¬ 
or  database  computers. 

1.2. 2. 2  IMPACT  OF  LEVEL  OF  CONCUHREKCT  OH  DBMS  PERFORMANCE:  TWO  EMPIR¬ 
ICAL  CASES 

In  addition  to  the  processing  overhead  and  critical  section  limita¬ 
tions,  a  concurrency  control  method  that  results  in  high  volume  of 
transaction  blocking  and  therefore  limits  the  level  of  concurrency 
achievable  in  the  system  will  inevitably  cause  computing  resources  to  be 
under-utilized  while  throughput  and  response  times  of  the  system  suffer. 
In  this  section,  we  provide  a  synopsis  of  two  cases  where  the  level  of 
concurrency  in  the  systems  has  proven  to  be  the  limiting  factor  of  per¬ 
formance  of  the  systems.  We  will  first  clarify  the  notion  of  level  of 
concurrency  through  a  simple  example. 
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Consider  the  following  interleaved  schedule  of  read  amd  write  steps 
fron  three  transactions  t0*  t1t  and  t2s 
W0(b)  Ri  (a)  R2(b)  VMb)  w2(c) 

where  Ri(a)  refers  to  a  request  from  transaction  to  read  data  element 
a,  and  W^b)  to  write  data  element  b,  acid  so  on.  Suppose  the  basic 
timestamping  algorithm  is  used  and  that  the  timestamp  of  tn  is  smaller 
than  that  of  t2.  Then  t<  will  be  denied  right  to  write  data  element  b 

(since  at  that  time  b  is  stamped  with  a  t2  read  timestamp  which  is 

greater  than  that  of  t,'s),  and  forced  to  abort.  On  the  other  hand,  if 
this  interleaved  sequence  of  access  requests  is  received  by  a  two  phase 
locking  facility,  t^s  request  to  put  a  write  lock  on  b  would  have  to  be 
blocked  till  t2  finishes  and  releases  its  read  lock  on  b.  However,  an 
examination  of  the  schedule  reveals  the  fact  that  the  schedule  is 
serializable  and  is  equivalent  to  a  serialized  schedule  in  which  t ,  is 
executed  after  t2.  Since  the  timestamp  algorithm  is  designed  to  syn¬ 
chronize  concurrent  transactions  by  timestamp  order,  any  serializable 
schedule  that  violates  the  timestamp  order  is  ruled  out.  Similar  argu¬ 
ments  prevail  for  the  locking  protocol. 

Blocking  or  aborting  transactions  cause  the  level  of  concurrency 
attainable  in  the  system  to  suffer,  which  in  turn  degrades  system  per¬ 
formance.  As  shown  in  the  above  example,  not  all  interferences  from  a 

concurrency  control  algorithm  through  blocking  or  aborting  are 
necessary.  Therefore,  in  <Papadimitriou82>,  optimality  of  a  concurrency 
control  algorithm  is  defined  as  the  degree  to  which  the  algorithm  allows 


for  • serializable  input  schedules'  to  proceed  without  being  interrupted. 
Due  to  issues  of  implementability,  no  known  concurrency  control  methods 
would  allow  all  serializable  schedules  to  proceed  without  interference. 
However,  optimization  of  currently  available  methods  through  transaction 
analysis,  such  as  in  the  case  of  the  hierarchical  database  decomposition 
approach  reported  in  this  thesis,  can  increase  the  level  of  concurrency 
of  a  DBMS. 

Now  we  use  two  cases  to  illustrate  the  effect  of  the  level  of  concur¬ 
rency  on  system  performance. 

1.2. 2. 3  BENCHMARKING  A  DATABASE  APPLICATION 

This  case  was  reported  by  the  BDM  Corporation  <BDM>.  in  1982,  the 
BDM  corporation,  a  general  consulting  firm  based  in  Washington  D.C.,  was 
contracted  to  develop  a  document  control  system  for  the  Department  of 
Defense,  called  The  Technical  Information  Management  Library  and  Docu¬ 
ment  Control  System.  The  system  was  developed  on  the  ORACLE  Database 
Management  System,  version  2,  a  commercially  availble  DBMS  released  in 
1981.  The  design  of  the  ORACLE  DBMS  received  much  influence  from  System 
R,  am  experimental  relational  database  management  system  developed  at 
IBM  Research  Lads. 

Two  observations  were  made  by  the  BDM  development  team  after  perform¬ 
ing  benchmark  testing  for  the  document  control  system:' 
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(1)  In  t»e  first  set  of  benchmark  tests,  two  sets  of  transactions 
were  run  against  the  system.  The  first  set  consists  of  only 
read-only  transactions,  and  were  run  without  any  interference 
from  the  concurrency  control  facility.  The  second  set  of  trans¬ 
actions  are  similar  in  nature  to  the  first  except  that  they  are 
update  transactions  and  therefore  requires  3  times  as  many  I/O 
as  the  set  of  read-only  transactions.  The  result  of  the 
benchmark  is  that  running  the  first  set  of  transactions  attained 
a  throughput  10  times  as  much  as  that  of  the  second  set.  It  was 
determined  that  the  portion  of  the  discrepency  in  performance 
not  explainable  by  the  differences  in  I/O  requirements  is  to  be 
attributed  to  the  concurrency  control  activity  which  has  appar¬ 
ently  limited  the  level'-  of  concurrency  attainable  by  the 
computer  system  and  severely  degraded  system  throughput.  Assum¬ 
ing  that  each  transaction  in  the  second  set  requires  3  times  as 
much  CPU  and  I/O  as  that  of  the  first  set  and  therefore  can  be 
considered  as  equivalent  to  3  transactions  in  the  first  set,  one 
draws  the  conclusion  that  for  this  particular  application  and 
set  of  transactions,  turnig  off  interference  from  the  concurren¬ 
cy  control  facility  may  result  in  a  330  percent  increase  in 
throughput.  Conversely,  concurrency  control  is  responsible  for  a 
75  percent  degradation  in  throughput. 

(2)  The  above  •  benchmark  tests  do  not  involve  conversational  trans¬ 
actions.  A  conversational  transaction  is  a  transaction  that 
involves  operator  think  time.  For  example,  if  a  user  inputs 


'Begin  Transaction'  and  requests  certain  processing  be  done, 
then  waits  for  a  certain  amount  of  time  CthinX  time')  to  elapse 
before  inputting  the  rest  of  the  processing  request  for  that 
transaction  before  issuing  the  'End  Transaction '  instruction, 
this  user  is  executing  a  conversational,  transaction.  The  second 
set  of  benchmark  tests  performed  by  BOM  is  a  comparison  between 
performance  of  conversational  update  transactions  and 
non-conversational  update  transactions.  At  the  development 
site,  a  benchmark  of  a  certain  type  of  update  transactions  was 
performed  and  was  reported  to  have  attained  a  throughput  rate  of 
10  transactions  per  minute.  This  test  case  did  not  take  opera¬ 
tor  think  time  into  consideration.  However,  when  the  system  was 
installed  at  the  user  site  and  was  tested  with  the  same  type  of 
update  transactions  with  operator  think  time  involved  in  the 
execution  of  these  transactions,  the  throughput  suffered  dramat¬ 
ically  and  dropped  to  1  transaction  per  minute,  with  a  response 
time  of  3  minutes  per  transaction.  A  diagnosis  of  the  cause  of 
this  degradation  indicated  that  interference  from  the  concurren¬ 
cy  control  facility  made  it  impractical  for  the  system  to  run 
more  than  3  transactions  at  a  time,  severely  limiting  the  use  of 
the  potential  capacity  of  the  system. 

1.2. 2. 4  PREDICTING  PERFORMANCE  OF  A  HIGH-THROUGHPUT  DBMS  APPLICATION 


-20- 


This  case  was  reported  by  BGS  Systems,  a  consulting  firm  based  in 
Massachusetts  that  specializes  in  computer  system  performance 
evaluation.  BGS  was  requested  in  1962  to  conduct  a  study  using  analyt¬ 
ical  modeling  tools  to  predict  the  performance  of  a  high-throughput 
(approx.  11  transactions  per  second  at  peak  load)  database  application 
implemented  on  IBM's  IMS  system,  for  a  client  who  wishes  to  remain  anon¬ 
ymous.  This  synopsis  of  the  result  of  this  study  is  based  on  a  report 
furnished  to  the  author  by  BGS  Systems  <BGS>. 

The  study  began  with  an  analysis  that  quantified  the  effect  of  compe¬ 
tition  between  different  IMS  processing  regions  (i.e.,  transactions)  for 
certain  highly-used  database  records,  in  particular,  it  documented  the 
fact  that,  even  through  the  computer  system  can  theoretically  allow  a 
much  larger  number  of  transactions  to  reside  in  the  system,  the  effec¬ 
tive  level  of  multiprogramming  attainable  by  the  current  application 
system  is  about  3,  beyond  which  blocking  due  to  concurrency  control 
would  eliminate  the  benefit  of  a  higher  multiprogramming  level.  This 
observation  is  captured  in  the  following  two  figures  extracted  from  the 
report . 

In  Figure  2,  predicted  system  response  times  in  minutes  are  plotted 
against  throughput  load  of  the  system,  if  no  concurrency  control  inter¬ 
ference  existed.  It  can  be  seen  that  the  response  time  increases  very 
slowly  as  throughput  load  increases,  signifying  that  the  computer  system 
resources  such  as  CPU  and  I/O  devices  are  not  heavily  utilized.  In  con- 
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System  Response  Time  Assuming  No  Blocking  Existed 
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Figure  3.  System  Response  Time  With  Blocking  Effect  Included 
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trast ,  Figure  3  presents-  the  result  of  the  system  performance  after 
introducing  concurrecy  control  blocking  effect  into  the  current  applica¬ 
tion.  The  A-curve  represents  the  response  time  of  the  same  set  of 
transactions  on  the  same  computer  system  as  shown  in  the  previous 
figure.  However,  the  interference  of  the  blocking  effect  from  the  con¬ 
currency  control  facility  of  the  DBMS  produced  a  steep  response  time 
curve  this  time,  and,  since  an  average  response  time  in  excess  of  2  min¬ 
utes  are  considered  unacceptable  by  the  client,  the  graph  clearly  states 
that  the  current  system  can  not  perform  satisfactorily.  Curves  B 
through  F  documents  predicted  performance  under  various  improvements 
done  to  the  computer  system,  including  such  measures  as  introducing 
faster  disks  and  better  arrangement  of  data  on  the  I/O  devices. 
However,  it  was  concluded  that,  while  with  these  tunings  the  current 
system  might  produce  adequate  results,  it  probably  would  not  be  able  to 
handle  further  increases  in  load  that  are  expected  to  take  place.  The 
study  recommended  a  redesign  of  the  application  system  to  better  handle 
the  concurrency  control  issue. 

1.2. 2. 5  SUMMARY 

The  above  two  cases  are  included  in  this  chapter  to  provide  empirical 
evidences  that  the  concept  of  system  limitation  due  to  database  concur¬ 
rency  control  is  an  important  one  and  is  expected  to  gain  attention  as 
databases  are  more  widely  installed  and  loads  on  these  systems  increase. 


1.2.3  AM  EXAMPLE  ILLUSTRATING  THE  SCOPE  OF  CURRENT  RESEARCH 


Consider  an  inventory  database  application  of  a  retail  business  with 
a  database  shorn  in  Figure  4.  There  are  several  types  of  transactions 
that  operate  on  this  database.  A  type  l  transaction  inserts  a  sales 
record  into  the  database  when  the  event  of  a  sales  occurs.  A  type  2 
transaction  inserts  a  stock-arrival  record  into  the  database  when  the 
event  of  a  stock  arrival  occurs.  A  type  3  transaction  is  generated 
periodically  for  each  item  in  the  inventory  to  compute  the  current 
inventory  level  of  that  item.  This  transaction  visits  the  sales  and 
stock-arrival  records  to  compute  the  net  change  ever  since  the  last  com¬ 
putation  of  the  inventory  level  of  that  item.  A  new  inventory  level  for 
that  item  is  then  posted  in  the  inventory  record. 

To  control  these  transactions  using  two-phase  locking,  a  type  3 
transaction  would  have  set  read  locks  on  every  read  access  it  has  gener¬ 
ated  to  the  sales  and  stock-arrival  records,  including  records  serving 
as  access  paths  such  as  index  records.  Using  the  timestamp  ordering 
approch,  the  type  3  transaction  would  have  left  read  timestamp  for  every 
such  record  it  has  read.  What  we  are  interested  in  here  is  whether 
there  is  a  concurrency  control  method  that  would  eliminate  the  need  for 
leaving  read  timestamps  (or  setting  read  locks)  when  a  type  3  trans¬ 
action  accesses  the  sales  and  the  stock-arrival  records.  An  analysis 
shows  that,  because  those  transactions  that  update  the  sales  and  the 
stock-arrival  records  do  not  access  the  portion  of  the  database  a  type  3 
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Figure  4.  An  example  inventory  database. 


transaction  would  update,  the  above  objective  can  be  achieved  as 
follows:  if  one  could  produce  a  snapshot  of  the  sales  and  the 
stock-arrival  records  for  a  type  3  transaction,  say,  t,  when  t  is  initi¬ 
ated,  and  have  t  operate  on  the  snapshot  rather  than  the  original 
records,  then,  without  compromising  serializability,  t  would  not  have  to 
set  read  locks  or  leave  read  timestamps  on  these  records.  This  means 
that  concurrent  updaters  of  these  records  (i.e.,  type  l  and  type  2 
transactions)  can  proceed  without  being  interfered  by  t. 

Let  us  introduce  yet  another  type  of  transactions.  Suppose  there  are 
type  4  transactions  which  are  also  generated  periodically  to  check  for 
the  need  of  reordering  certain  stock  items.  This  type  of  transaction 
reads  the  stock-arrival  records  and  adjust  the  stock-on-order  records 
(by  setting  the  arrival-date  field  of  such  records.)  It  then  computes  a 
gross  inventory  level  by  summing  the  current-inventory-level  and  the 
quantities  indicated  by  the  non-arrived  stock-on-order  records.  Based 
on  this  gross  inventory  level  a  decision  is  made  as  to  whether  to  reor¬ 
der  this  item.  If  it  decides  to  do  so,  an  order  request  is  printed  and 
a  stock-on-order  record  is  generated  and  inserted. 

It  is  noted  that  a  type  4  transaction  reads,  but  does  not  update,  the 
stock-arrival  and  the  inventory  records.  Therefore  it  is  desirable  if 
one  can  generalize  the  above  observation  on  the  type  3  transaction  to 
apply  to  the  type  4  transactions  such  that  the  read  accesses  to  the 
stock-arrival  and  inventory  records  by  a  type  4  transaction  can  proceed 
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•  STRUCTURAL  DISCIPLINE 


'igure  5. 
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The  Structure  of  the  Example  Application 
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without  inducing  interference  with  the  other  three  types  of 


transactions . 


This  example  depicts  a  database  system  whose  applications  are  struc¬ 
tured  in  a  special  way  as  shown  in  Figure  5.  In  the  figure  it  is  shown 
that  a  transaction  type  can  be  paired  with  a  particular  portion  of  the 
database  it  updates  and  has  read  accesses  (dotted  arrows)  to  other 
portions  of  the  database.  In  fact,  the  above  case  cam  be  generalized 
further  by,  for  example,  adding  to  it  transaction  types  that  read  from 
the  reordering  records  and  the  stocK-arrival  records  to  generate  suppli¬ 
er  profile  records  in  the  database .  As  the  list  goes  on,  a  hierarchy  of 
transaction  types  is  formed  such  that  at  each  level  of  the  hierarchy  the 
transaction  only  reads  from,  but  does  not,  or  rarely,  write  into  data 
written  by  transaction  types  of  an  earlier  level.  This  structure  pre¬ 
sents  a  definite  opportunity  for  concurrency  control  algorithms  to 
explore . 


1.2.4  RESEARCH  GOALS  AND  ACCOMPLISHMENTS 


Most  concurrency  control  algorithms,  for  simplicity,  choose  to  ignore 
the  above  opprotunity  for  reducing  synchronization  overhead,  and  com¬ 
plies  strictly  with  the  assumption  that  all  transactions  may  read  and 
write  any  part  of  the  database  and  therefore  every  access  has  to  be  con¬ 
trolled.  The  goal  of  the  current  research  is  to  demonstrate  that  a  4(/4" 
tzmJUs.  txploiAoution  of  the  read-wr±te  pattern  of  transactions  can  be 


effectively  used  to  achieve  the  objective  of  reducing  concurrency  con¬ 
trol  overhead. 

In  this  thesis,  we  will  demonstrate  the  potential  benefit  of  incorpo¬ 
rating  into  a  concurrency  control  algorithm  the  knowledge  of  special 
structures  of  the  applications.  Sepcifically,  we  accomplish  the  follow¬ 
ing  tasks: 

(1)  Formalize  the  concept  of  'hierarchically  organized  database 
applications'  and  identify  types  and  properties  of  such  organ¬ 
izations. 

(2)  Develop  a  timestamp-based  concurrency  control  algorithm,  called 
the  Hierarchical  Timestamp  Algorithm  (HTS) ,  that  is  capable  of 
taking  advantages  of  the  existence  of  such  organizations  in 
database  systems.  Specifically,  we  achieve  the  goal  of  allowing 
certain  read  acesses  requested  by  update  transactions  to  proceed 
without  ever  having  to  set  read  timestamps  or  having  to  create 
the  danger  of  causing  other  concurrent  transactions  to  abort. 
In  other  words,  the  algorithm  allows  certain  update  transactions 
to  proceed  in  a  way  that  does  not  interfere  at  all  with  certain 
other  concurrent  update  transactions.  In  addition,  a  protocol 
for  synchronizing  the  read-only  transactions  in  a  hierarchically 
organized  database  system  is  also  developed  which  allows  the 
read-only  transaction  to  proceed  in  a  way  that  does  not  inter¬ 
fere  at  all  with  any  concurrent  update  transaction. 


The  algorithm,  consisting  of  a  total  of  three  types  of  proto¬ 


cols  for  synchronizing  different  types  of  accesses  from  update 
transactions,  can  he  considered  as  a  generalized  timestamp  algo¬ 
rithm  parameterized  by  a  hierarchical  structure  of  the  applica¬ 
tions.  This  algorithm  degenerates  to  a  conventional  timestamp 
algorithm  (therefore  consisting  of  only  one  relevant  protocol) 
when  such  a  hierarchical  structure  is  not  recognized. 

(3)  Propose  an  implementation  scheme  for  the  hierarchical 
timestamping  algorithm  which  demonstrates  the  benefit  of  the 
above  algorithm  in  concrete  terms.  In  designing  the  implementa¬ 
tion,  a  methodology  is  also  developed  for  analyzing  and  laying 
out  system  modules  and  shared  data  structures  where  concurrency 
is  of  concern.  This  methodology  is  believed  to  have  bearings  on 
concurrent  system  design  in  general. 

(4)  Formulate  as  an  optimization  problem  the  task  of  recognizing  a 
favorable  hierarchical  structure  given  a  database  application 
system.  The  problem,  unfortunately,  is  proven  to  be  NP-hard, 
and  a  heuristic  algorithm  is  proposed  for  solving  it. 

(5)  Apply  the  concept  of  the  hierarchical  database  organization 
■  approach  to  concurrency  control  to  three  different  problem  areas 

and  demonstrate  the  efficacy  of  incorporating  this  concept  into 
solving  the  concurrency  control  problems  in  database  management 


systems 


It  is  important  to  stress  that  the  thrust  of  the  thesis  research  is 
more  than  proposing  a  new  concurrency  control  algorithm.  It  demonstrates 
the  potential  benefit  of  exploiting  the  knowledge  and  the  structure  of 
the  application  systems  to  implement  more  efficient  and  more  tailored 
concurrency  control  mechanism.  It  also  points  out  an  area  of  research 
that  has  not  been  fully  examined,  namely,  flow  to  design  transaction*  so 
that  con.cuJiie.ncy  control  tasks  can  be.  sijnplibied  without  compromising 
the  managerial  Acquirements  on  the  data,  it  is  believed  that  trans¬ 
action  design  with  the  concurrency  control  problems  in  mind  could 
produce  a  structure  of  applications  that  is  easier  and  less  costly  for 
concurrency  control  to  be  implemented. 

1.3  THE  STRUCTURE  OF  THE  THESIS 

The  thesis  is  composed  of  a  total  of  ten  chapters.  In  the  first 
chapter,  the  problem  of  database  concurrency  control  is  introduced  and 
the  motivation  and  the  scope  of  the  thesis  research  is  discussed.  It 
also  summarizes  the  goals  and  the  accomplishments  of  this  thesis 
research. 

In  Chapter  two,  a  literature  overview  is  presented  to  bring  out  rele¬ 
vant  recent  research  developments  in  the  database  concurrency  control 
area.  It  provides  a  perspective  as  to  how  this  thesis  research  is 
related  to  recent  trends  in  concurrency  control  research.  This  chapter 
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also  contains  a  review  of  important  formalisms  involved  in  the  concur¬ 
rency  control  problem  so  as  to  provide  the  background  for  under standing 
the  proofs  to  be  presented  in  the  subsequent  chapters. 

In  Chapter  three,  the  concept  of  the  hierarchical  organization  of 
database  applications  is  formalized.  It  classifies  the  tasks  of  algo¬ 
rithm  development  in  such  organizations  into  three:  protocols  for  con¬ 
trolling  update  transactions  under  non-cyclic  database  partition, 
protocols  for  controlling  update  transactions  under  cyclic  database  par¬ 
tition,  and  protocol  for  controlling  read-only  transactions  where  udpate 
transactions  obey  the  above  protocols.  These  three  tasks  are  separately 
described  in  Chapters  four  to  six.  In  each  of  these  chapters,  the  basic 
definitions  and  the  protocols  are  first  described  with  illustrating 
examples.  The  proofs  of  correctness  then  follow.  These  protocols  col¬ 
lectively  form  the  hierarchical  timestamp  concurrency  control  algorithm. 

In  Chapter  seven,  a  scheme  is  proposed  for  implementing  the  hierar¬ 
chical  timestamp  protocols  developed  in  the  previous  chapters.  This 
scheme  addresses  the  problems  of  timestamp  computation  and  mangement, 
multi-version  database  implementation  and  the  problem  of  garbage  col¬ 
lection. 

In  Chapter  eight,  we  examine  the  problem  of  how  to  identify  a  favora¬ 
ble  hierarchical  database  organization  that  can  be  used  by  the  hierar- 


chical  tiaestaap  algorithm.  The  complexity  of  this  problem  is  analyzed 
and  a  heuristic  algorithm  for  solving  it  is  proposed. 

In  Chapter  nine,  we  discuss  the  efficacy  issue  of  the  hierarchical 
timestamp  approach  to  database  concurrency  control.  Three  problem  areas 
(the  B-tree  access  method,  a  banking  application  and  a  database  computer 
application)  are  explored. 

Finally,  in  Chapter  ten,  we  summarize  the  thesis  and  discuss  future 
research  directions. 


v. 
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2.0  LITERATURE  REVIEW 


This  chapter  is  composed  of  two  parts.  The  first  section  is  am  over¬ 
view  of  the  recent  works  and  trends  in  concurrency  control  research  to 
provide  a  perspective  as  to  how  the  work  reported  in  this  thesis  is 
related  to  other  works  in  the  field.  The  second  section  reviews  basic 
concepts  and  formalisms  in  the  concurency  control  theorey  to  provide  a 
background  for  the  theoretic  developement  to  follow  in  the  subsequent 
chapters . 

2.1  OVERVIEW  OF  RELEVANT  RECENT  RESEARCH  IM  DATABASE  CONCURRENCY  COW-  , 
TROL 

Concurrency  control  in  a  centralized  or  a  distributed  database  system 
has  been  an  active  research  area.  The  concept  of  database  consistency 
has  been  formally  analyzed  in  <Eswaran76,  Gray76>,  in  which 
set-theoretic  notions  are  used  to  formulate  the  concept.  A  consistent 
schedule  of  concurrent  transactions  has  been  defined  to  be  one  which  is 
equivalent  to  a  serialized  schedule.  Two-phase  locking  has  been  pro¬ 
posed  in  their  papers  as  a  mechanism  which  preserves  serializability . 
This  notion  of  serializability  has  been  further  explored  in 
<Bernstein79>  and  <Papadimitriou79>. 
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2.1.1  ALGORITHMS,  METHODS  AHD  OTHER  ISSUES 

Algorithms  for  database  concurrency  control  abound  in  the  literature. 
The  distributed  database  present  a  challenge  for  the  consistency  problem 
which  has  encouraged  development  of  many  new  algorithms.  <For  example. 
Ellis77 ,  Lamport 78,  Rosenkrantz78 ,  Thomas79,  Bernstein8l.>  A  survey  and 
comparison  of  theories  and  algorithms  of  concurrency  control  can  be 
found  in  <Bernstein81>. 

Most  algorithms  are  considered  variations,  extensions  and/or  combina¬ 
tions  of  the  two  basic  techniques  for  concurrency  control  -  two-phase 
locking  and  time  stamp  ordering.  The  two-phase  locking  algorithm 
ensures  consistency  by  imposing  a  partial  order  on  all  transactions 
based  on  their  lock  points.  (A  lock  point  of  a  transaction  is  the  point 
in  time  when  the  locking  phase  of  the  transaction  reaches  its  peak.) 
The  timestamp  ordering  algorithm,  on  the  other  hand,  ensures  consistency 
by  imposing  a  partial  order  on  all  transactions  based  on  the  initiation 
times  of  the  transactions.  The  latter  is  sometimes  referred  to  as  the 
optimistic  concurrency  control  method  <Ullman82>  because  it  allows 
transactions  to  proceed  without  locking  the  data  elements  in  hope  that 
conflicts  would  not  occur.  It,  however,  timestamps  the  data  elements 
with  the  timestamp  assigned  to  the  transaction  (i.e.,  usualy  the  initi¬ 
ation  time  of  the  transaction)  that  accesses  the  data  elements  in  order 
to  detect  conflicts  when  they  do  occur  and  resolve  them  through  trans- 
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action  aborts.  These  two  techniques,  two-phase  locking  and  timestamp 
ordering,  have  been  used  as  the  basis  for  further  explorations. 

Another  optimistic  alternative  to  two-phase  locking  for  concurrency 
control  is  the  validation  method  <Kung8l,  Ceri82>.  Like  the  timestamp 
algorithm,  the  validation  method  allows  the  transaction  to  proceed  rela¬ 
tively  freely  without  setting  locks  on  data  elements  accessed.  Unlike 
the  timestamping  method,  however,  the  algorithm  does  nothing  (i.e.,  not 
even  timestamp  data  elements)  during  the  execution  of  the  transactions . 
Instead,  at  the  end  of  the  transaction  execution,  the  transaction  enters 
a  'validation  phase*  during  which  the  set  of  the  data  elements  the 
transaction  has  accessed  is  intersected  with  the  access  sets  of  all  oth¬ 
er  concurrently  committed  transactions.  The  transaction  is  validated  if 
the  intersection  is  empty,  and  is  aborted  if  it  is  not.  Obviously,  -this 
method  rests  on  the  assumption  that  conflicts  rarely  occurs,  and  there¬ 
fore  virtually  all  transactions  would  pass  the  validation  phase.  Other 
variations  of  the  validation  method  have  also  been  discussed 
<Bernstein82b> . 

Other  issues  concerning  concurrency  control  algoirthms  in  general  are 
the  data  granularity  and  crash  recovery.  In  <Gray75>,  <Korth82>  and 
<Carey83>  the  issues  involved  in  granularity  hierarchies  are  discussed, 
while  in  <Ries77>  and  <Ries79>  the  effect  of  the  choice  of  data  granules 
on  the  performance  of  the  concurrency  control  algorithms  is  addressed. 
In  <Gray78>,  the  two-phase  commit  policy  is  discussed  which  addresses 


the  issue  of  handling  a  system  crash  during  a  transaction’s  write  phase 


In  order  not  to  have  the  effects  of  such  such  transactions  cascade  in  an 
undesirable  fashion,  the  two-phase  commit  policy  is  used  to  effect 
'atomic  commit',  which  stipulates  that  in-place  database  updates  should 
not  take  place  until  a  transaction  has  forced  all  its  upates  to  a 
secured  log. 

2.1.2  MULTI-VERSION  METHODS 


One  recent  development  in  concurrency  control  algorithms  centers 
around  the  identification  of  techniques  that  increase  the  level  of  con¬ 
currency  and/or  reduce  synchronization  overhead,  at  the  same  time  pre¬ 
serving  the  correctness  of  the  algorithm.  One  approach  is  the  use  of  a 
multi-version  database.  it's  been  observed  that  keeping  multiple  ver¬ 
sions  of  database  elements  will  improve  concurrency  of  the  database 
<Papadimitriou82>.  In  Papadimitriou ' s  paper,  it  is  shown  that  there 
exists  an  infinite  hierarchy  of  multi-version  serializability ,  and  prov¬ 
en  that  the  more  versions  a  DBMS  keeps,  the  higher  is  the  level  of 
concurrency  it  may  achieve. 

The  concept  of  a  timestamp-based  multi-version  database  system  was 
first  proposed  in  <Reed78>.  It  is  a  general  scheme  in  which  the  identi¬ 
fier  of  a  data  element  consists  of  two  components:  the  name  of  the  data 
and  the  version  of  the  data.  In  Reed's  scheme,  retrieval  of  an  arbi¬ 
trary  time  slice  of  the  database  is  allowed. 


A  more  limited  multi-version  concept  was  developed  in  <Bayer80>.  In 
his  scheme,  one  previous  version  of  a  data  element,  which  has  been  saved 
for  recovery  purposes  when  the  data  element  is  going  through  changes 
made  by  uncommitted  transactions,  is  utilized  to  allow  read  accesses  to 
proceed  without  having  to  wait  for  the  commitment  of  the  update  trans¬ 
action.  However,  read  locks  still  have  to  be  set  by  the  read  requests 
and  there  is  the  additional  overhead  of  explicitly  maintaining  a  trans¬ 
action  dependency  graph.  An  extension  of  this  method  to  a  distributed 
database  is  proposed  in  <Bayer80b>.  In  <Viemont82>,  an  interesting 
method  for  concurrency  control  is  devised  which  also  makes  use  of  this 
one  extra  copy  of  data  elements  to  synchronize  transactions  by  order  of 
commit  time.  In  essance  his  technique  is  one  which  blends  timestamp 
ordering  and  two-phase  locking  in  one  and  chooses  to  switch  to  one  or 
the  other  at  the  most  opportune  time  so  as  to  increase  the  level  of  con¬ 
currency.  In  <Stearns81,  Chan82>  the  one-previous-version  method  was 
extended  to  accommodate  multiple  previous  versions  (but  does  not  allow 
for  access  of  an  arbitrary  time  slice  of  the  database  from  a  user.) 
Stearns'  method  extends  the  method  propsoed  in  <Bayer80>  to  exploit  the 
benefit  of  the  existence  of  multiple  previous  versions.  Read  locks, 
however,  are  still  needed  for  all  transactions.  Cham's  method  is  also 
based  on  two-phase  locking  but  allows  the  read-only  transaction  to 
receive  special  treatment  -  they  do  not  have  to  set  read  locks.  Extend¬ 
ing  this  method  to  a  distributed  database  is  presented  in  <Chan82b>.  In 
<DuBourdieu82>  a  method  similar  to  Chan's  is  also  discussed.  In 


<Garcia-Holina82>,  a  framework  of  strategies  for  processing  read-only 
transactions  is  presented. 

The  above  list  of  research  bears  resemblance  to  the  research  to  be 
reported  here.  However,  our  technique  is  one  which  is  timestamp  based 
and  strives  to  reduce  the  need  for  leaving  read  timestamps  for  not  just 
read-only  transactions,  but  update  transactions  as  well. 


2.1.3 


EXPLOITING 


1  a  L»r<  a  auln  ^«)  J|) 


Another  approach  to  reducing  synchronization  overhead  is  to  use  the 
method  of  transaction  analysis  to  obtain  a  priori  the  knowledge  of  the 
nature  of  the  transactions  to  be  -run  in  the  system.  Two  notable  exam¬ 
ples  are  the  conflict  analysis  <Bernstein80b>  and  the  hierarchical  lock¬ 
ing  protocol  <Silberschatz80 ,  Kedem80>. 


In  the  research  on  concurrency  control  for  SDD-l,  conflict  analysis 
was  proposed  which  exploits  a  priori  knowledge  of  the  nature  of  the 
transactions  to  be  run  in  the  system.  (To  some  extent,  some  of  the 
research  listed  in  the  previous  paragraphs  concerning  providing  special 
protocols  for  read-only  transactions  falls  into  this  category  too,  as  it 
exploits  the  knowledge,  albeit  a  limited  one,  of  the  nature  of  the 
transactions,  namely,  whether  they  are  read-only  or  not.)  The  basic 
idea  behind  the  SDD-l  approach  is  ’synchronizing  only  when  necessary’. 
It  synchronizes  a  transaction  in  one  class  with  only  those  transactions 
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in  classes  that  are  in  conflict  with  that  transaction.  If  only  two 
transactions  Jure  run  concurrently  and  they  belong  to  classes  that,  via 
transaction  analysis  performed  before  run  time,  do  not  conflict,  then 
there  is  no  need  to  incur  control  overhead.  However,  if  two  trans¬ 
actions  are  in  conflict,  then  timestamps  plus  intra-class  serialized 
pipelining  are  used  to  ensure  consistency. 

The  approach  reported  in  the  present  paper  is  different  from  that  of 
SDD-l  because  it  is  not  oriented  towards  distributed  database  systems, 
and,  because  of  the  special  structure  of  applications  that  our  approach 
exploits,  together  with  the  fact  that  the  multiple  version  technique'  is 
employed,  the  protocols  are  much  less  restricted.  These  new  protocols 
are  therefore  more  practical  to  implement. 

The  hierarchical  locking  protocol,  sometimes  referred  to  as  the  tree 
protocol,  also  exploits  special  knowledge  of  the  nature  of  the  trans¬ 
actions.  Specifically,  it  assumes  a  priori  knowedge  that  the  sequence 
of-  accesses  to  the  data  elements  exhibits  a  tree  pattern.  For  example, 
let  data  elements  x  and  y  in  the  database  be  accessed  by  a  transaction 
only  if  the  transaction  has  previously  accessed  another  data  element  z, 
then  one  may  consider  z  as  the  'parent'  of  x  and  y  and  the  access  path 
that  a  transaction  follows  is  always  from  the  parent  data  elements  to 
the  child  data  elements.  In  general,  if  all  data  elements  in  the  data¬ 
base  can  be  structured  as  a  tree  and  the  access  path  to  these  elements 
followed  by  transactions  are  always  congruent  to  the  tree  structure. 


then  a  non-two-phase  locking  protocol  can  be  used  to  synchronize  the 
accesses.  The  tree  protocol  is  correct,  but  it  is  not  two-phase  and  is 
capable  of  allowing  a  higher  level  of  concurrency  than  the  two-phase 
locking  algorithm.  Generalizing  the  tree  protocol  to  DAG  (directed 
acyclic  graph)  has  also  been  attempted.  This  concept  of  exploiting 
structural  knowledge  of  the  database  applications  to  identify  better 
concurrency  control  methods  is  further  discussed  in  <Kung79>,  where  the 
notion  of  'optimality  of  concurrency  control  algorithm'  is  explored. 

While  the  research  reported  in  this  thesis  also  uses  acyclic  graphs 
as  tools  for  analyzing  transactions  and  organizing  groups  of  data  ele¬ 
ments  (hence  the  term  'Hierarchical  Timestamp  Algorithm'),  it  bears  lit¬ 
tle  relationship  with  the  hierarchical  locking  protocol  developed  by 
Kedem  and  Silberschatz.  The  hierarchical  locking  protocol  is  concerned 
with  the  knowledge  of  the  natural  sequence  of  the  data  requests  from  the 
transactions  (e.g.,  first  z,  then  x  and  y) ,  and  attemps  to  achieve  early 
releases  of  locked  elements  by  exploiting  this  knowledge.  The  HTS  algo¬ 
rithm  is  not  concerned  about  the  sequence  of  accesses  but  the  modes  of 
accesses  from  groups  of  similar  transactions  (e.g.,  always  read  z  but 
never  write  z) ,  and  exploits  this  knowledge  to  achieve  the  effect  of  not 
having  to  leave  traces  for  reading  data  elements.  Therefore  these  two 
approaches,  while  sharing  the  philosophy  of  'knowledge  exploitation*  and 
the  tool  of  acyclic  graphs,  are  very  different  in  substance. 
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In  the  following  section,  we  will  review  the  basic  formalisms  in 


1  database  concurrency  control  and  pave  the  way  for  formally  developing 

the  hierarchical  timestamp  algorithm  in  the  subsequent  chapters. 

I 

2.2  BASIC  CONCEPTS  OF  MULTI-VERSION  CONSISTENCY 

» 

I 

In  this  section,  we  review  the  basic  concepts  and  formalisms  in  data¬ 
base  concurrency  control.  The  concurrency  control  theory  centers  around 
the  notion  of  'correctness'  of  schedules.  A  schedule  is  a  sequence  of 
steps  such  as  read,  write  or  read-and-write  of  data  elements  in  the 
database.  An  input  schedule  is  a  sequence  of  steps  representing 
,  requests  for  actions  on  data  elements  submitted  to  the  concurrency  con- 

I 

trol  (cc)  facility  by  the  transactions.  An  output  schedule  is  a  sequence 
of  steps  allowed  by  the  concurrency  control  facility,  which  is  basically 
a  permutation  of  the  input  schedule.  The  purpose  of  the  CC  facility  is 
to  take  an  arbitrarily  interleaved  input  schedule  and  produce  an  output 
schedule  containing  the  same  steps  but  in  a  sequence  that  is  connect. 

2.2.1  TRANSACTION  PROCESSING  KODELS 

To  prove  that  a  concurrency  control  algorithm  is  correct,  it  is  nec¬ 
essary  that  a  precise  definition  of  correct  schedules  be  given.  As  dis¬ 
cussed  before,  serializability  is  a  common  criterion,  and  is  defined  as 
follows : 


.*  ^  -•  >  ■ 
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Definition.  A  schedule  s  is  serializable  if  there  exists  an  equiv¬ 
alent  schedule  Ss  where  all  transactions  in  Ss  are  serialized,  (i.e.,  no 
steps  of  one  transaction  are  interleaved  with  steps  from  another  trans¬ 
action.) 

Mow  we  must  give  the  precise  meaning  of  • schedule  equivalence ' . 
Intuitively,  two  schedules  of  a  set  of  transactions  are  equivalent  if 
their  'effects'  on  the  database  are  equivalent.  Since  it  is  only  real¬ 
istic  to  assume  that  the  CC  facility  is  blind  to  the  intent  of  a  trans¬ 
action's  use  of  a  data  element,  the  common  notions  of  equivalence  cam 
only  be  based  on  the  assumption  that  every  data  element  retrieved  by  a 
transaction  would  have  am  impact  on  data  elements  later  written  by  the 
transaction,  and  every  write  chamges  the  value  of  the  data  element. 

However,  the  above  concensus  does  not  eliminate  the  existence  of  the 
different  ways  in  which  the  meaning  of  a  'step'  in  a  schedule  is  inter¬ 
preted.  As  a  result,  there  are  various  transaction  processing  models 
each  with  its  own  notion  of  a  'step'  in  a  schedule.  The  simplest  model 
is  the  action  model  in  which  it  is  assumed  that  every  step  in  a  schedule 
is  both  a  read  and  a  write  of  the  requested  data  element  <Stearns76>.  A 
more  general  model  differentiates  a  read-only  step  from  a  read-write 
step,  and  is  sometimes  called  a  conflict  model  <Bernstein79>.  The  most 
general  model  is  the  read-only/write-only  model,  which  does  away  with 
the  assumption  implied  in  the  conflict  model  that  a  write  must  entail  a 
previous  read.  <Papadimitriou79>.  The  notion  of  schedule  equivalence 
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therefore  must  depend  on  the  underlying  assumption  about  what  each  step 
in  a  schedule  means. 

In  addition  to  the  differences  due  to  transaction  processing  models, 
schedule  equivalence  also  depends  on  whether  it  is  meant  to  be 
'final-state  equivalence'  <Papadimitriou79>  or  'view  equivalence' 
<Rosenkrantz82>.  The  former  stipulates  that  two  schedules  are  equiv¬ 
alent  if  the  final  database  states  are  equivalent.  Under  this  assump¬ 
tion,  those  transactions  that  do  not  generate  any  effect  in  the  final 
database  state  (e.g.,  read-only  transactions)  can  be  considered  redun¬ 
dant,  or  'dead',  and  would  not  in  any  way  affect  the  decision  as  to 
whether  the  schedule  is  serializable.  The  latter  is  concerned  with 
whether  every  transaction  sees  the-  same  database  state  in  the  two  sched¬ 
ules  rather  than  whether  the  final  database  state  is  the  same. 

The  differences  in  defining  schedule  equivalence  results  in  different 
technical  definitions  for  serializability ,  which  in  turn  results  in  dif¬ 
ferent  algorithms  for  testing  for  serializability.  For  example,  testing 
serializability  of  a  schedule  under  the  action  model  amounts  to  detect¬ 
ing  cycles  in  a  simple  directed  graph,  for  which  a  polynomial  time  algo¬ 
rithm  exists,  while  testing  that  under  the  read-only/write-only  model 
amounts  to  detecting  cycles  in  a  polygraph,  which  is  NP-complete. 

In  sum,  to  prove  that  a  concurrency  control  mechanism  is  corret,  one 
must  state  the  assumptions  underlying  the  transaction  processing  model 


and  the  precise  meanning  of  schedule  equivalence.  With  this  background, 
we  now  present  the  basic  concepts  in  nulti-version  serializability. 

2.2.2  MULTI- VERSION  SERIALIZABILITY 


Two  definitions  for  mutli-version  serializability  have  recently  been 
proposed.  Both  definitions  cure  based  on  the  most  general  transaction 
processing  model,  namely,  the  read-only/write-only  model,  but  differ  in 
the  definitions  of  the  read  steps  in  a  schedule.  The  difference  lies 
with  whether  the  version  number  of  the  data  element  of  a  read  step  is 
specified. 

In  <Bernstein82> ,  a  multi-version  (mv)  schedule  is  composed  of  read 
steps  in  the  form  of  rjfd1)  and  write  steps  in  the  form  of  w^d1),  where 
TjCd1)  is  interpreted  as  'transaction  j  reads  the  version  i  of  data  ele¬ 
ment  d'  and  w 1 (d 1 )  is  interpreted  as  'transaction  i  creates  the  version 
i  of  data  element  d. '  Any  feasible  mv  schedule  is  subject  to  the  con¬ 
straint  that  if  rjCd1)  is  in  the  schedule,  then  w^d1)  must  be  before  rj 
(d1)  in  the  schedule.  This  constraint  merely  says  that  a  version  cannot 
be  read  until  it  has  been  created.  Under  this  definition,  the  version 
that  a  read  step  is  reading  is  determined  in  the  schedule.  To  test  for 
serializability  of  a  mv  schedule  defined  in  this  way,  one  must  find  out 
if  there  exists  a  version  order,  such  that,  given  this  version  order 
denoted  as  «,  the  graph  defined  below  is  acyclic:  (There  are  notational 
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differences  in  the  definition  given  here  and  that  given  in 
<Bernstein82>) 

Definition.  A  transaction  dependency  graph  of  a  mv  schedule  S  and  a 
version  order  «,  denoted  as  TG(S,«),  is  a  digraph  where  the  nodes  are 
the  transactions  in  S  and  the  arcs,  i  ->  j,  representing  ’ transaction  i 
depends  on  j',  are  assigned  according  to  the  following  rules: 

i  ->  j  iff 

(1)  Wj(dJ)  and  r,(dJ)  are  in  S,  or 

(2)  rj(dk)  and  w,(d1)  are  in  S  and  it  «  i,  or 

(3)  Wj(dJ)  and  rittd1)  are  in  S  and  j  «  i. 

Intuitively,  this  means  that-  a  transaction  i  is  said  to  depend  on 

(i.e.,  must  follow)  transaction  j  if  i  reads  a  version  of  a  data  element 
created  by  j,  or  i  writes  over  (i.e.,  creates  a  subsequent  version)  a 
version  of  a  data  element  read  by  3,  or  i  writes  over  a  version  of  a 
data  element  created  by  j  and  the  version  created  by  i  is  read  by  any 
transaction.  Assuming  a  read-only/write-only  data  model,  the  following 
theorem  is  given  in  <Bernstein82>: 

Theorem.  A  mv  schedule  S  is  serializable  iff  there  exists  a  version 
order  <<  such  that  TG(S,<<)  is  acyclic. 

It  is.  shown  that,  under  this  definition,  testing  for  serializability 
given  a  mv  schedule  is  NP-complete.  Intuitively,  the  complexity  of  the 


r 


testing  algorithm  stems  from  the  combinatorial  explosion  of  the  number 


of  possible  version  orders  that  the  algorithm  must  examine.  However,  if 
a  version  order  is  given,  then  the  problem  is  reduced  to  one  which 
tests  for  cycles  in  a  simple  graph,  which  is  not  a  hard  problem. 

In  <Papadimitriou82> ,  a  mv  schdules  is  defined  to  be  composed  of  read 
steps  in  the  form  of  rj(d)  and  write  steps  in  the  form  of  w^fd1),  where 
rj(d)  denotes  ’transaction  j  reads  an  (unspecified)  version  of  d'  and  wt 
(d*)  has  the  usual  meaning .  in  other  words,  the  mv  schedule  in  this 
definition  retains  the  flexibility  of  assigning  versions  to  be  read,  so 
long  as  the  assignment  satisfies  the  constraint  that  a  version  must  be 
created  before  it  is  read.  With  this  additional  flexibility,  testing 
for  serializability  of  a  schedule- must  also  take  into  consideration  all 
the  possible  ways  of  assigning  versions  to  the  read  steps.  Therefore,  a 
mv  schedule  is  serializable  if  and  only  if  there  exists  am  i ntenpreta- 
tion  I  of  the  schedule,  i.e.,  am  assignment  of  versions  to  read  steps  in 
S,  and  a  version  order  «,  such  that  the  tramsaction  dependency  graph 
TG(S, I , <<)  is  acyclic.  Intuitively,  the  term  'am  interpreted  schedule' 
is  equivalent  to  the  term  'a  schedule'  in  the  previous  definition.  It 
is  not  difficult  to  see  that  testing  for  seriaaizability  of  a  schedule 
under  this  second  definition  is  also  NP-hard,  while  testing  for 
serializability  of  am  interpreted  schedule  given  a  version  order  is 
equivalent  to  testing  for  existence  of  cycles  in  a  simple  directed 
graph . 
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It  can  be  concluded  that,  while  testing  for  serializability  of  a  mul¬ 
ti-version  schedule  is  a  hard  problem,  it  is  quite  straight-forward  if 
the  mv  schedule  is  interpreted  (i.e.,  version  numbers  have  been  assigned 
in  the  read  and  write  steps)  and  a  version  order  is  determined.  For  our 
purpose,  fortunately,  the  schedules  that  must  be  tested  for 
serializability,  in  order  to  prove  the  correctness  of  the  protocols,  are 
interpreted,  and  the  specific  version  orders  are  given.  We  summarize 
the  above  discussion  by  giving  the  following  definitions  and  a  theorem 
which  give  the  definitive  word  on  how  to  prove  correctness  of  our 
mutli-version  concurrency  control  algorithm  to  be  developed  in  the  sub¬ 
sequent  chapters. 

Definition.  A  schedule  s  of  a  set  of  transactions  I  is  a  sequence  of 
steps,  each  of  which  is  denoted  as  a  tuple  of  the  form  a^d-1).  a1 
refers  to  an  'action1  on  behalf  of  a  transaction  t<.  The  action  can  be 
read  (r)  or  write  (w) .  dJ  is  a  version  of  a  data  granule,  where  d  indi¬ 
cates  the  data  granule  and  j  indicates  the  version.  If  the  action  is 
write,  then  the  version  of  the  data  granule  included  in  the  step  is  cre¬ 
ated  by  the  transaction.  If  the  action  is  read,  then  the  transaction 
reads  the  version  of  the  data  granule  indicated  in  the  tuple. 

An  example  of  a  schedule  is  w^d’),  r2(d1),  w2(dJ),  r3(dJ). 


Definition.  Assume  that  a  version. order,  denoted  as  «,  is  given.  A 
version  j  of  a  data  element  d  is  the  predecessor  of  a  version  k  of  d  if 
j  «  it  and  there  exists  no  version  i  of  d  such  that  j  «  i  «  k. 

Definition,  a  transaction  dependency  graph  of  a  schedule  s  is  a 
digraph,  denoted  as  TG(S),  where  the  nodes  are  the  transactions  in  S  and 
the  arcs,  representing  direct  dependencies  between  transactions,  exist 
according  to  the  following  rules: 

t2  -  t,  «  A  iff 

(1)  w,(dJ)  and  r2(dJ)  are  in  S  for  some  dJ,  or 

(2)  r^d-*)  and  w2(dk)  are  in  S  for  some  dJ,dk  where  dJ  is  the 
predecessor  of  dk,  or 

(3)  w, (dJ) ,  w2(dk)  and  rj(dk)  are  in  S  and  dJ  i3  a  predecessor 
of  dk. 

In  other  words,  the  transaction  dependency  graph  represents  a 
relation  -  (i.e.,  'depends  on',  or  'follows')  of  transactions  such  that 
t2  -  t (  if  1 2  reads  a  version  of  a  data  granule  created  By  t,  or  if  t2 
creates  a  version  of  a  data  granule  whose  predecessor  has  been  read  by 
1 1 ,  or  1 2  creates  a  version  of  a  data  granule  whose  predecessor  was  cre¬ 
ated  by  t,  and  the  version  created  by  t2  is  eventually  read  by  some 
transaction  in  S. 


The  following  theorem  follows  from  the  discussion  given  previously: 


3.0  HIERARCHICAL  DATABASE  DECOMPOSITION  -  DEFINITIONS  AMD  PROPERTIES 


As  discussed  in  Chapter  one,  the  hierarchical  timestamp  algoirthm  for 
concurrency  control  is  devised  to  explore  the  opportunity  that  a  set  of 
hierarchically  organized  applications  of  a  database  may  present.  A 
hierarchy  of  applications  was  briefly  described  as  one  in  which  the 
transactions  belonging  to  applications  at  one  level  of  the  hierarchy 
would  only  read  from,  but  would  not,  or  would  rarely,  write  into  data 
produced  by  applications  of  an  earlier  (i.e.,  a  higher)  level.  Hierar¬ 
chically  organized  applications  may  offer  an  opportunity  to  optimize  a 
concurrency  control  mechanism  because  those  read  accesses  from  a  trans¬ 
action  at  one  level  to  data  produced  by  transactions  at  a  higher  level 
may  be  synchronized  using  a  special,  less  expensive  protocols.  In  this 
chapter,  we  will  formalize  this  concept  of  'a  hierarchy  of  applications' 
and  set  the  stage  for  presenting  the  actual  algorithms  in  the  subsequent 
chapters . 

we  capture  the  essence  of  a  hierarchy  of  applications  through  the 
definition  of  a  data  segment  hierarchy.  A  data  segment  hierarchy,  con¬ 
structed  during  the  database  installation  phase  (i.e.,  before  run  time), 
is  basically  a  partial  order  of  a  given  set  of  data  segments  of  the 
database  subject  to  additional  graph-theoretic  constraints  and  con¬ 
straints  related  to  transaction  accesses.  In  this  chapter  we  will 


discuss  only  what  constitues  (i.e.,  the  formal  definition  of)  a  data 


segment  hierarchy  given  a  database  segmentation.  We  will  not,  however, 
address  the  issue  of  how  the  database  segmentation  scheme  is  arrived  at 
and  how  an  optimal  data  segment  hierarchy,  if  there  exist  more  than  one, 
can  be  found.  The  latter  issues  will  be  addressed  in  a  later  chapter  on 
database  decomposition  methodology. 


3.1  DATABASE  PARTITION  AMD  THE  DATA  SEGMENT  GRAPH  (DSG) 

Let  the  database  be  partitioned  into  a  number  of  data  segments.  We 
will  use  the  concept  of  a  data  segment  graph  I  DSG  I,  constructed  by  means 
of  transaction  analysis,  to  reflect  the  accesses  by  database  trans¬ 
actions  to  the  data  segments  in  a  database.  As  will  be  shown  later,  the 
topology  of  the  data  segment  graph  will  be  used  to  analyze  whether  a 
particular  partial  order  of  the  data  segments  constitute  a  data  segment 
hierarchy  relevant  to  the  hierarchical  timestamp  (HTS)  concurrency  con¬ 
trol  scheme. 

Informally,  let  a  database  be  partitioned  into  data  segments.  A  DSG 
is  a  digraph  with  nodes  corresponding  to  the  data  segments  and  arcs  con¬ 
structed  in  such  a  way  that  there  is  an  arc  from  a  data  segment  to 
another  data  segment  Dj  if  and  only  if  one  can  find  a  potential • trans¬ 
action  in  the  database  system  that  updates  data  elements  in  Di  and 
accesses  (i.e.,  reads  or  writes)  data  elements  in  D<.  In  other  words, 


D  t  -  D j ,  i  *  j,  indicates  that  there  exist  transactions  in  the  system 
that  would  link  updates  of  data  elements  in  D ,  to  the  content  of  data 
elements  in  Dj. 

Definition.  Let  Tu  be  a  set  of  update,  transactions  to  be  performed 

on  a  database  D.  Let  P  be  a  partition  of  D  into  dite  segments 

...,Dn.  A  date  hierarchy  graph  of  P  with  respect  to  Tu  is  a  digraph 
denoted  as  DSG(P,TU)  with  nodes  corresponding  to  the  data  segments  of  P 
and  a  set  of  directed  arcs  joining  these  nodes  such  that,  for  i  *  j, 

-  Dj  iff  there  exist  t  «  Tu  s.t.  w(t)  Q  Di  *  empty  and  a(t)  0  Dj  ** 

empty,  where  t  is  a  transaction,  w(t),  r(t)  and  a(t)  the  write  set,  the 
read  set  and  the  access  set  of  transaction  t.  (The  access  set  a(t)  is 
the  union  of  r(t)  and  w(t).) 

Construction  of  a  data  segment  graph  given  a  set  of  data  segments  and 
update  transactions  is  illustrated  in  Figure  6.  The  data  segment  graph 
is  a  tool  for  capturing  the  pattern  of  accesses  among  transactions  to  be 
run  in  the  system.  Mote  that,  in  our  algorithm,  there  is  no  need  for 
read-only  transactions  to  participate  in  this  transaction  analysis, 
eliminating  the  difficulties  of  pinning  down,  a  priori,  the  nature  of 
all  ad  hoc  queries. 


3.2  THE  DATA  SEGMENT  HIERARCHY  (DSH 


Data  Segments 


Data  Segment  Graph 
DSG(P , Tu) 


where  Transaction  Analysis  is  performed  on  the  set  of  update 
transactions  Tu  in  the  system,  and  Di  Dj  iff  (1)  i  t  j 
and  (2)  among  transactions  that  write  into  Di,  there  exists 
one  that  reads  from  or  write  into  Dj . 


gure  6. 


Illutration  of  a  data  segment  graph  DSG. 


In  this  section  we  will  give  the  definition  and  the  properties  of  the 
data  segment  hierarchy.  Ve  first  briefly  introduce  the  concept  of  a 
digraph  called  a  semi-tree.  This  concept  will  then  be  used  in  the  defi¬ 
nition  of  data  segment  hierarchy.  Informally,  a  semi-tree  is  a  digraph 
such  that,  if  the  directions  of  the  arcs  in  the  graph  are  ignored,  the 
graph  appears  to  be  a  spanning  tree. 

Definition.  A  semi-tree  is  a  digraph  such  that  there  exists  at  most 
one  undirected  path  between  any  pair  of  nodes  in  the  graph.  Every  arc 
in  a  semi-tree  is  called  a  critical  arc. 

An  example  of  a  semi-tree  is  shown  in  Figure  7.  By  deinition,  a 
semi-tree  has  the  property  that  there  exists  at  most  one  directed  path 
between  any  pair  of  nodes.  Now  we  give  the  definition  of  a  data  segment 
hierarchy. 

Definition.  Given  a  data  partition  P  and  a  data  segment  graph 
dsg(p,tu) ,  a  data  segment  hierarchy,  denoted  as  DSH(p,tu),  is  the  tran¬ 
sitive  reduction  of  any  non-cyclic  graph  of  nodes  in  DSG(P,TU)  such  that 

(1)  DSH(P,TU)  is  a  semi-tree,  and 

(2)  If  Di  ->  Dj  is  contained  in  DSG(P,TU)  then  either  D<  ->  Dj  or  Dj 
->  Di  is  contained  in  the  transitive  closure  of  DSH(P,TU). 

In  other  words,  a  data  segment  hierarchy  for  a  given  partition  is  a 
partial  order  of  the  data  segments  3uch  that  between  any  pair  of  data 


gure  7. 


Illustration 


a  semi-tree  and  a  data  segmen 


hierarchy. 


segments  that  are  ordered  in  the  partial  order  there  exists  only  one 


hierarchical  path  that  leads  one  data  segment  to  the  other.  In 
addition,  any  pair  of  data  segments  that  are  connected  via  a  directed 
path  in  the  data  segment  graph,  defined  in  the  previous  section,  must 
also  he  ordered  in  this  partial  order.  An  illustration  of  a  data  seg¬ 
ment  hierarchy  is  also  shown  in  Figure  7. 

The  reason  why  these  restrictions  are  placed  in  the  definition  of  a 
data  segment  hierarchy,  i.e.,  the  reason  why  it  is  not  simply  the  tran¬ 
sitive  reduction  of  any  non-cyclic  graph,  will  be  made  clear  after  we 
define  the  concept  of  transaction  classfication  in  the  next  sub-section. 
Here  we  address  the  issue  of  the  existence  of  DSH  given  any  database 
partition. 

There  may  be  multiple  data  segment  hierarchies  that  satisfy  the  above 
definition  given  a  database  partition.  However,  since  any  total  order 
of  the  data  segments  is  a  semi-tree,  the  existence  of  a  data  segment 
hierarchy  given  a  database  partition  is  always  guaranteed  because*  any 
total  order  that  has  DSH(P,TU)  as  a  subset  can  be  used  as  a  data  segment 
hierarchy. 

Lenaa  0.1 .  (Existence  of  OSH)  Given  a  database  partition  P  and  a  set 
of  update  transactions  Tu,  there  exists  a  data  segment  hierarchy 


DSH (P ,TU) . 


In  the.  remainder  of  tftis  thesis,  the  notation  DSH*  refers  to  a  par¬ 


ticular  chosen  data  segment  hierarchy.  We  will  say  that  a  data  segment 
Di  is  higher  then  a  data  segment  Dj,  denoted  as  Di  ♦>  Dj,  in  a  data  seg¬ 
ment  hierarchy  DSH*,  if  there  exists  a  path  in  DSH*  from  Dj  to  Df.  In 
general,  we  say  that  data  segment  Di  and  data  segment  Dj  are  releted  if 
either  Di  *  Dj  or  Di  and  Dj  are  connected  by  a  directed  path.  We  will 
also  denote  the  path  from  Di  to  Dj  in  DSH*  as  CP , J  (CP  standing  for 
Critical  Path.) 


3.3  TRANSACTION  CLASSIFICATION 


As  discussed  before,  the  purpose  of  defining  a  data  segment  hierarchy 
for  a  database  is  to  formally  capture  the  concept  of  a  'hierarchy  of 
applications ' .  Therefore  the  concept  of  a  data  segment  hierarchy  must 
be  paired  with  a  scheme  which  assigns  the  transactions  in  the  system  to 
correspond  to  some  levels  in  the  data  segment  hierarchy.  We  accomplish 
this  by  'classifying'  transactions  based  on  the  hierarchical  positions 
of  the  data  segments  that  the  transactions  have  to  write  into.  We  first 
give  the  following  property. 

Property.  If  there  exists  a  transaction  t  in  Tu  such  that  Di  0  w(t) 
*  empty  and  Dj  0  w(t)  *  empty  and  Di  #  Dj,  then  Di  and  Dj  are. ordered  in 
DSH(?,TU) . 


Proof.  Follows  directly  from  the  definition  of  the  data  segment 
hierarchy . 

That  is,  if  a  transaction  writes  into  more  than  one  data  segment, 
then  all  the  data  segments  that  it  writes  into  must  be  connected  by  a 
directed  path  in  the  data  segment  hierarchy. 

Given  a  data  segment  hierarchy  DSH*,  we  will  assign  update  trans¬ 
actions  in  Tu  to  the  kigke-it  data  segment  it  writes  into.  (The  fact 
that  a  highest  data  segment  exists  follows  from  the  above  property.) 
Formally, 

Definition.  Given  a  DSH*,  a  transaction  t  is  rooted  in  a  data  seg¬ 
ment  D,,  denoted  as  t  t  D,,  iff  (l)  w(t)  0  D1  *  empty,  and  (2)  if  there 
exists  D j  **  D i  such  that  w(t)  fl  Dj  *  empty,  then  D,  t>  Dj  in  DSH*. 

As  a  digression,  we  can  now  discuss  the  motivation  behind  our  defi¬ 
nition  of  DSH.  The  goal  is  to  build  a  partial  order  of  the  data  seg¬ 
ments  such  that  we  can  define  a  'rooting  scheme'  of  the  transactions 
such  that  (l)  if  a  transaction  is  rooted  in  Di  and  has  to  access  data 
elements  in  Dj  then  D*  and  Dj  are  ordered  in  the  data  segment  hierarchy 
(i.e.,  it  is  possible  to  determine  whether  an  access  is  from  a  trans¬ 
action  rooted  in  one  data  segment  to  a  higher,  lower  or  the  same  data 
segment  in  the  segment  hierarchy,)  and  (2)  this  ordering  is  established 
through  a  unique  path  in  the  partial  order.  These  are  the  basic  proper- 
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tics  of  a  DSH  that,  as  will  be  shown  later,  are  relied  upon  by  our 
concurrency  control  algorithm.  The  following  property  formalizes  the 
above  discussion: 

Property.  (The  Basic  Property  of  a  Data  Segment  Hierarchy).  Let 
DSH*  be  a  data  segment  hierarchy  for  a  database  partition  P  with  a  set 
of  udpate  transactions  Tu.  Let  t  e  D*  in  DSH*.  Let  d  e  Dj  be  a  data 
element  in  the  access  set  of  t  (i.e.,  d  t  w(t)  U  r(t).)  Then  either  Di 
*  Dj  or  there  exists  a  unique  path  between  Di  and  Dj  in  DSH*. 

Proof.  If  D,  *  Dj  then  by  the  definition  of  data  segment  graph  Di  -> 
D j  is  contained  in  DSG(P,TU).  Therefore  by  the  definition  of  data  seg¬ 
ment  hierarchy  there  must  exist  a  path  between  Di  and  Dj  in  DSH*.  Since 
DSH*  is  a  semi-tree,  by  definition,  the  path  between  Di  and  Dj  must  be 
unique.  Q.£.D. 

For  notational  convenience,  we  also  define  a  transaction  classifica¬ 
tion  which  gives  a  name  to  a  set  of  transactions  rooted  in  the  same  data 

% 

segment . 

Definition,  a  transection  classification  with  respect  to  a  data  seg¬ 
ment  hierarchy  DSH*  for  a  database  partition  P  and  a  set  of  update 
transactions  Tu,  is  a  partition  of  the  set  Tu  into  transaction  classes 
T i ,  T2,  ...,  Tn,  such  that  a  transaction  t  e  T,  iff  t  is  rooted  in  data 
segment  D i . 
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Therefore  a  transaction  classification  partitions  the  set  of  update 
transactions  into  classes,  each  of  which  corresponds  to  a  data  segment 
in  the  data  partition  and  the  notation  't  e  T  * '  has  the  same  meaning  as 
't  c  D 1 ' .  Consequently,  we  will  call  accesses  from  a  transaction  to 
data  elements  in  its  own  root  data  segment  as  i ntra-class  accesses, 
while  accesses  from  a  transaction  to  data  elements  in  data  segments  oth¬ 
er  than  its  own  root  data  segment  as  i nten-cl 8SS  accesses. 


3.4  TYPES  OF  SYNCHRONIZATION  PROTOCOLS 

Based  on  the  above  definitions,  a  hierarchy  of  classes  of  update 
transactions  is  devised.  A  class  rooted  in  a  data  segment  may  access  a 
data  segment  that  is  higher  than  or  lower  than  the  root  data  segment  or 
access  data  within  its  own  root  data  segment.  Since  an  update  trans¬ 
action  is  always  rooted  in  the  highest  data  segment  that  it  writes  into, 
an  access  to  a  higher  data  segment  must  be  a  read  access,  while  the 
accesses  to  a  lower  or  the  root  data  segment  could  be  either  read  or 
write.  The  rationale  behind  the  HTS  algorithm  is  the  belief  that  it  is 
possible  to  devise  a  synchronization  protocol  for  accessing  higher  level 
data  segments  that  is  'cheaper'  than  those  accessing  the  root  or  the 
lower  data  segment.  The  purpose  of  such  a  hierarchy,  therefore,  is  to 
make  it  possible  to  classify  an  update  transaction's  accesses  to  the 
database  into  accesses  to  a  higher,  lower  or  the  root  data  segment,  and 
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differentiate  the  synchronization  protocols  necessary  for  controlling 
each  type  of  accesses. 

In  addition  to  the  update  transaction  synchronization  protocols,  a 
protocol  for  synchronizing  read-only  transactions  is  also  necessary. 
The  goal  is  to  enable  a  read-only  transaction  to  access  any  data  segment 
in  the  database  without  ever  having  to  set  read  timestamps  or  be  blocked 
or  be  aborted.  Recall  that  the  construction  of  a  data  segment  hierarchy 
does  not  take  into  consideration  the  read-only  transactions,  therefore 
there  is  not  a  concept  of  'rooting'  that  applies  to  the  read-only  trans¬ 
actions,  and  the  protocols  developed  for  the  update  transactions  may  not 
always  apply. 

In  summary,  we  have  identified  four  types  of  synchronization  proto¬ 
cols  each  of  which  would  apply  to  a  type  of  database  access: 

(1)  Synchronization  protocol  for  residing  a  data  segment  higher  thsrn 
the  root  data  segment  of  the  transaction. 

(2)  Synchronization  protocol  for  reading  or  writing  the  root  data 
segment  of  the  trsmsaction. 

(3)  Synchronization  protocol  for  reading  or  writing  a  data  segment 
lower  thsrn  the  root  data  segment  of  the  transaction. 

(4)  Synchronization  protocol  for  reading  any  data  segment  by  a 


read-only  trsmsaction 


3.5  WOW-CYCLIC  VS.  CYCLIC  DATABASE  PARTITION 

In  this  section,  we  will  present  a  type  of  database  partition  that 
may  result  in  a  data  segment  hierarchy  in  which  accesses  to  lower  data 
segments  by  an  update  transaction  do  not  exist,  and  therefore  is  the 
type  of  database  partition  that  requires  only  two  out  of  the  three  pro¬ 
tocols  for  synchronizing  update  transactions  listed  in  the  previous  sec¬ 
tion:  the  protocol  for  reading  higher  data  segments  and  the  one  for 

accessing  root  data  segment.  As  will  be  explained  later,  this  would  be 
the  kind  of  database  partitions  that  are  most  suited  for  applying  the 
hierarchical  timestamping  algorithm. 

Based  on  the  topology  of  DSG(P,TU),  a  partition  P  cam  be  of  the 
cyclic  type  or  the  acyclic  type  as  defined  below. 

Definition.  A  partition  P  of  a  database  0  into  data  segments  is 

acyclic  ( cyclic i  with  respect  to  Tu  if  DSG(P,TU)  is  acyclic  (cyclic). 

We  will  give  the  properties  of  a  non-cyclic  database  partition  that 

explain  why  accesses  to  lower  data  segment  from  a  transaction  rooted  in 

a  higher  data  segment  do  not  exist. 


Properties  of  lion-cyclic  Database  Partitions 


(l)  Given  a  database  partition  P  and  its  data  segment  graph  DSG(P,TU) 
where  DSG(P,TU)  is  non-cyclic,  there  exists  a  data  segment  hier- 


archy  DSH*  such  that  the  transitive  closure  of  DSH*  contains 
DSG(P,TU). 

Proof .  Since  DSG(P,TU)  is  acyclic,  it  defines  a  partial 
order  of  the  data  segments.  Any  total  order  of  the  data  segments 
that  contains  this  partial  order  is  a  data  segment  hierarchy. 
Therefore  there  exists  a  data  segment  hierarchy  DSH*  such  that 
the  transitive  closure  of  DSH*  contains  DSG(P,TU).  Q.E.D. 

We  will  refer  to  the  DSH  that  satisfies  the  above  condition 
for  a  non-cyclic  database  partition  as  a  natural  DSH  for  the 
partition.  We  will  assume  that  for  a  non-cyclic  partition  the 
data  segment  hierarchy  chosen  would  always  be  a  natural  DSH. 

(2)  Let  p  be  an  acyclic  database  parition  with  respect  to  T,J.  Then  t 

e  Tu  writes  in  one  and  only  one  data  segment  in  P. 

Proof .  Suppose  t  writes  in  two  distinct  data  segments  Di  and 

Dj.  Then  according  to  our  rule  of  construction  of  the  data  seg¬ 
ment  graph  DSG(P,TU),  D<  -  Djf  Dj  -  D<  £  DSG(P,TU),  therefore 
DSG(P,TU)  cannot  be  acyclic  which  contradicts  the  assumption. 
Q.E.D. 

(3)  Let  DSH*  be  a  natural  DSH  for  a  non-cyclic  database  partition 
with  update  transaction  set  Tu.  Then  every  access  to  a  non-root 
data  segment  from  any  transaction  in  Tu  is  a  read  access  to  a 
higher  data  segment. 

Proof .  Based  on  the  definition  of  a  data  segment  hierarchy, 
every  inter-class  access  to  a  higher  data  segment  is  a  read 

access.  Therefore  we  need  only  to  show  that  no  access  to  a  low- 
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er  data  segment  exists.  This,  however,  is  true  because  if  it 
were  not  true,  and  suppose  some  transaction  rooted  in  Dj  needs 
to  access  some  data  element  in  D *  and  D4  is  lower  than  Dj  in 
DSH* ,  then  there  exists  D<  ~>  Dj  in  DSG(P,TU)  while  Dj  ->  Dj 
cannot  be  contained  in  the  transitive  closure  of  DSH*,  which 
contradicts  the  definition  of  a  natural  data  segment  hierarchy. 
Q.E.D. 

The  above  properties  establish  the  fact  that  only  two  types  of 
accesses  exist  for  a  non-cyclic  database  partition  that  has  chosen  a 
natural  data  segment  hierarchy:  those  to  the  root  data  segment  of  the 
transaction  or  those  (read  accesses)  to  data  segments  that  are  higher 
than  the  root  data  segment  of  the  transaction.  In  the  following  chapter 
the  concurrency  control  protocols  for  these  two  types  of  accesses  are 
developed  and  shown  to  be  correct. 

It  is  easy  to  see  that  for  a  cyclic  database  partition,  the  above 
properties  no  longer  hold.  Therefore,  to  apply  the  hierarchical 
timestamping  algorithm  to  more  generalized  data  base  partitions , 
accesses  from  a  transaction  to  lower  data  segments  must  be  considered. 
In  chapter  five,  we  extend  the  theory  provided  in  chapter  four,  and  show 
that  such  accesses  can  be  controlled  by  a  third  protocol,  and  without 
affecting  the  protocols  developed  in  Chapter  four,  this  set  of  three 
protocols  for  synchronizing  update  transactions  collectively  ensure  the 
consistency  of  the  database. 


Finally,  in  chapter  six,  we  develop  the  protocol  for  synchronizing 
read  -  accesses  from  read-only  transactions  that  are  run  in  a  database 
whose  update  transactions  are  synchronized  by  the  above  three  protocols. 


4.0  SYNCHRONIZING  UPDATE  TRANSACTIONS  UMDER  HON-CYCLIC  PARTITIONS 


Given  a  non-cyclical  database  partition,  the  two  types  of  database 
accesses  from  an  update  transaction  are  the  read  and  write  accesses  to 
its  root  data  segment  (intra-class  accesses)  and  read  acceses  to  higher 
data  segments  (inter-class  accesses.)  The  key  to  our  concurrency  con¬ 
trol  technique  is  the  recognition  that,  if  a  transaction  t  belongs  to  a 
class  T  <  that  writes  data  segment  D<  and  reads  data  segment  Dj ,  and  Dj 
is  higher  than  d4  in  the  Data  Hierarchy  Graph,  then  this  transaction 
would  appear  to  be  a  read  only  transaction  so  far  as  D  j  is  concerned. 
Therefore  when  a  request  to  read  a  data  element  d  in  Dj  is  issued  by  t, 
there  may  exist  a  proper  committed  version  of  d  that  is  to  be  given 
to  t  without  the  need  of  leaving  a  read  timestamp  with  d.  However,  the 
way  this  proper  version  is  computed  must  be  such  that  the  overall 
serializability  is  enforced.  In  other  words,  the  introduction  of  trans¬ 
action  dependency  of  t  on  t',  where  t'  is  the  transaction  in  class  Tj 
which  created  the  version  of  d  that  t  is  allowed  to  read,  must  never 
induce  cycles  in  the  transaction  dependency  graph  as  defined  in  Section 
2.2.  To  this  end,  a  function  called  the  activity  link  function  is 
devised  to  compute  versions  that  inter-class  read  accesses  may  be 
granted,  and  a  theorem  which  testifies  to  the  correctness  of  this  compu¬ 
tation  is  presented.  Based  on  this  theorem,  a  concurrency  control 
algorithm  is  also  presented. 


Hot  it  ions. 


(1)  I (t)  *  the  initiation  time  of  a  transaction  t. 

(2)  C(t)  ■  the  commit  time  of  a  transaction  t. 

(3)  TS(<JW)  *  the  inititation  time  of  the  transaction  that  creates  the 
version  v  of  a  data  granule  d,  i.e.,  the  write  timestamp  of  dv. 
(A  data  granule  is  the  smallest  unit  that  concerns  the  concurren¬ 
cy  control  component  of  the  database  system,  and  is  the  smallest 
unit  of  accesses  so  far  as  concurrency  control  is  concerned.) 


4.1  BASIC  DEFINITIONS 


The  following  definitions  and  properties  assume  a  database  with  a 
non-cyclical  partition  with  respect  to  Tu  and  has  a  natural  database 
segment  hierarchy  DSH* . 

Definition.  A  function  I,°1d  defined  for  a  data  segment  Di  is  a 
function  which  maps  a  time  m  to  another  time  m' ,  i.e.,  m'  =  l1°,el(m), 
such  that  m'  is  the  initiation  time  of  the  oldest  active  (i.e., 
uncommited  and  un-aborted)  transaction  rooted  in  the  data  segment  D1  at 
time  m.  Formally, 


Ii°1d(m) 


’m  if  there  exists  no  t  e  Of  active  at  time  ra, 

Min  ( I (t ) )  otherwise,  where 

t  £  Ti,  I(t)  <  m  and  C(t)  >  m. 


Definition.  Let  the  activity  link  function  AiJ  be  a  function  defined 


for  a  pair  of  data  segments  0«  and  Dj ,  where  D j  ♦>  Dj  in  DSB* .  A , J 
recursively  Daps  a  tine  n  to  another  tine  as  follows. 


A,J(m) 


I j°,d(m)  if  D<  -  D4  ■  CP,j 

hkj(Afk(m))  otherwise,  where 
*  D,  ■*  Dk  Dj  ■  CP*4. 


Intuitively,  the  purpose  of  function  A , i  is  to  map  a  tine  m  corre¬ 
sponding  to  D,  to  another  time  m*  corresponding  to  Dj  where  versions  in 
Dj  before  m'  are  considered  'safe'  for  a  transaction  rooted  in  D<  and 
started  at  time  m  to  access.  As  an  example,  if  the  critical  path 
between  D(  and  Dj  is  D<  -  Dk  -  Dj,  then  AiJ(m)  *  Ij0,e,(Ik0la(m)) . 
This  zs  exemplified  in  Figure  8. 


4.2  CONCURRENCY  CONTROL  ALGORITHM  FOR  UPDATE  TRANSACTIONS 


Based  on  the  definitions  given  above,  we  describe  in  this  section  the 
concurrency  control  algorithm  for  synchronizing  update  transactions 
under  the  hierarchical  timestamping  approach. 

The  algorithm  is  composed  of  two  protocols,  one  for  synchronizing 
intra-class  accesses  and  one  for  synchronizing  read-only  accesses  to 
higher  data  segments.  The  former  is  equivalent  to  the  conventional 


timestamp  algorithm,  where  both  read  and  write  accesses  will  result  in 
timestamping  the  accessed  data  element.  The  latter  is  a  protocol  which 
grants  a  particular  version  of  the  data  element  to  the  requesting  trans¬ 
action  and  involves  no  need  for  timestamping.  This  'particular  version' 
is  chosen  to  be  the  latest  version  right  before  a  certain  time  ceiling 
computed  from  the  A  function. 

For  the  purpose  of  concurrency  control,  we  assume  that  every  data 
segment  is  controlled  by  a  Ae.gme.tut  con&iotteji  which  supervises  accesses 
to  data  granules  within  that  segment. 

Concurrency  control  algorithm  for  update  transactions: 

For  every  database  access  request  from  an  update  transaction  t  e  T, 
for  a  data  granule  d  c  Oj,  the  following  protocol  is  observed: 

Protocol  H 

% 

If  i  *  3,  then  the  segment  controller  of  Dj  provides  the  version  d°  of  d 
such  that 

TS(d°)  »  Max(TS(dv) )  for  all  v  such  that 
TS(dv)  <  A,J(I(t)). 

(Note  that  no  trace  of  this  access  needs  to  be  registered  in  any  form 
for  the  purpose  of  concurrency  control  by  the  segment  controller.) 

Protocol  i 
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If  i  *  j,  then  use  the  ocuic  tun&tfamp  OAd&ung  protocol  <Bernstein60> 
or  the  mltc-vejuion  timestamp  omzjung  protocol  <Reed78>. 


4.3  AN  EXAMPLE 


As  an  example,  suppose  a  transaction  t  rooted  in  Di  needs  to  read 
data  elements  x  «  D<,  w  t  Dk,  y  c  Dj,  and  write  data  element  z  c  D1r 
where  Dj  ♦>  Dk  ♦  >  d 1  in  the  data  segment  hierarchy.  An  execution  of 
this  transaction  using  protocols  H  and  e  is  shown  in  Figure  9  where  the 
dots  represent  versions  of  data  elements  and  the  circled  dots  represent 
versions  to  be  accessed  or  created  by  t.  In  essence,  to  read  x,  t  is 
given  the  version  right  before-  I(t)  (the  circled  version  of  x  in 
Figure  9)  and  I(t)  is  left  as  the  read  timestamp  with  this  version  of  x. 
To  read  w,  t  is  given  the  version  right  before  A,k(l(r))  (the  circled 
version  of  w  in  Figure  9)  but  no  timestamp  is  left  with  this  version. 
To  read  y,  t  is  given  the  version  right  before  AiJ(I(t))  (again,  the 
circled  version  of  y  in  the  figure)  and  no  timestamp  is  left.  Finally, 
to  write  z,  the  latest  version  of  z  is  found  and  verified  to  be  a  ver¬ 
sion  earlier  than  I(t)  and  which  does  not  have  a  read  timestamp  greater 
than  I  (t ) .  Then  a  new  version  of  z  is  created  with  a  timestamp  I(t). 

It  can  be  seen  that  the  difference  between  this  execution  and  that  of 
one  using  the  conventional  timestamp  scheme  lies  with  the  way  w  and  y 
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daca 

eleaents 


Figure  9. 


Illustration  of  Applying  Protocol  K  and  Protocol  E. 


arc  accessed,  which,  under  the  HTS  scheme,  does  not  interfere  at  all 
with  transactions  that  are  concurrently  updating  w  and  y. 

4.4  PROOF  OF  CORRECTNESS 

To  show  that  the  above  algorithm  is  correct,  one  must  show  that 
serializability  is  enforced.  In  order  to  do  this,  we  follow  the  steps 
listed  below: 

(1)  Define  a  relation  called  ' toplogically  followa1,  denoted  as  •»>', 

between  a  pair  of  transactions  and  prove  two  properties  of  the 
relation. 

(2)  Show  that  these  properties  of  the  relation  '•>•  lead  to  Theorem 

1,  which  states  that  if  a  concurrency  control  algorithm  allows  a 
transaction  1 t  to  directly  depend  on  smother  transaction  t2  only 
if  1 1  *>  t2,  then  the  transaction  dependency  graph  is  cycle 

free. 

(3)  Show  that  the  two  protocols  introduced  previously  allows  a  t,  to 
directly  depend  on  a  transaction  t2  only  if  t,  =>  t2,  which  con¬ 
cludes  that  the  above  algorithm  preserves  serializability . 

Definition.  A  relation  topologically  follows  (denoted  as  =>)  is 
defined  for  a  pair  of  transactions  t1f  t2,  where  tt  e  D(,  t2  e  Dj,  D< 
and  Dj  are  connected  by  a  critical  path  in  a  chosen  data  segment  hierar- 


chy  DSH* ,  i  and  j  not  necessarily  distinct.  We  say  that  t  1 
topologically  follows  t2  (or  t<  »>  t2)  iff 

(1)  if  Di  •  Ds  then  l(ti)  >  I(t2). 

(2)  If  ♦>  D j  then  l(t,)  i  Ajf(X(t2)). 

(3)  If  D j  t>  d,  then  l(t2)  <  A^tKt,)). 

Intuitively,  =>  is  a  relation  between  transactions  based  on  both  the 
timing  of  the  transactions  and  the  hierarchical  levels  in  the  THG  of  the 
transaction  classes  that  the  transactions  belong  to.  To  be  more  specif¬ 
ic,  ,t1  ■>  t2f  always  means  that  1 <  is  'later'  than  t2.  However,  this 

'later'  is  not  only  based  on  the  initiation  times  of  the  two  trans¬ 
actions  involved,  but  also  on  the  relative  levels  of  the  data  segments 
in  which  ti  and  t2  are  rooted:  Given  a  fixed  t2,  the  lower  the  level  of 
the  data  segment  in  which  t ,  is  rooted,  the  later  t^s  initiation  time 
has  to  be  in  order  for  ^  =>  t2  to  hold.  Clearly,  *»  is  defined  only 
between  transactions  that  are  rooted  in  data  segments  that  are  on  a 
critical  path  in  DSH*,  because  otherwise  the  A  function  is  undefined. 
This  relation  is  exemplified  in  Figure  10.  Two  interesting  properties 
concerning  the  relation  =>  are  presented  below: 

Property  1.1.  The  relation  =>  is  anti- symmetric.  (This  directly 
follows  from  the  definition  of  the  relation.) 
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tl 


— I— 

I(t2) 


I(cl) 


(1)  if  Di  -  Dj 

then  I(tl)  >  I(t2) 


time 


I  i  "  ■  time 

A.1  (I (c2)  I (cl) 

2  i 


(2)  if  Di  higher  than  Dj 
then  I(tl)  £  A.i(I(t2)) 


Dj 

t  ti: 


I(t2)  A.j(ICtD) 

i  1 
I 


(3)  if  Dj  higher  than  Di  then 
I(t2)  <  A^CKtl)) 


Figure  1C.  GraDhical  representation’  of  the  relation  tl  =>  t2.. 

- - -  M 


Property  1.2  (The  property  of  transitivity/ .  The  relation  =>  is 
criti cel-peth  transitive,  i.e.,  if  there  exists  t ,  e  Dlf  t2  «  Dk,  t3  e 
Dj,  such  that  t,  *>  t2,  t2  *>  t3  ana  D<,  Dk  and  Dj  are  on  a  critical 
path  in  DSH* ,  then  t,  =>  t3. 

Proof.  To  prove  this,  we  first  give  the  following  two  obvious  prop¬ 
erties  of  the  function  A: 

(0.1)  If  Dj  +>  Dk  ♦>  Di  in  DSH*  then  AkJ(Ajk(m))  *  A,j(m).  (This 
directly  follows  from  the  fact  that  in  DSH*  of  there  exists  one 
and  only  one  critical  path  between  any  pair  of  data  segments.) 

(0.2)  A,J(m)  has  a  non-negative  slope  (i.e.,  if  m  >  m'  then  A,J(m)  i 
A1J(m'),  and  if  A^fm)  >  A<J(m')  then  m  >  m'.) 

We  consider  the  following  5  groups  of  cases: 

(1)  D1  »  Dk  ■  Dj.  By  definition  of  «>  we  have  I(tj)  >  l(t2)  >  l(t3). 

Therefore  tt  *>  t3. 

(2)  D<  =  Dk  r  Dj.  Two  cases  are  considered: 

(2.1)  D,  t>  Dj.  Then  t2  *>  t3  implies  I(t2)*  Aj1(2(t3)).  t, 
*>  t2  implies  I  (t , )  >  I  (t 2) .  Therefore  I(t,)  >  Aj’fHts)). 
Therefore  t,  ~>  t3. 

(2.2)  Dj  ♦>  Di-  Then  t ,  *>  t2  implies  I(t,)  >  I(t2).  By  Prop¬ 

erty  0.2  we  have  A^tHtO)  i  A,J(I(t2)).  t2  =>  t3  implies 
A i J (I (t2) )  >  I(t3).  Therefore  A, J (I (t,))  >I(t3).  There¬ 

fore  1 1  *>  t3. 

(3)  D,  #  Dk  =  Dj  Two  cases  are  considered: 
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(3.1)  D,  t>  Dj.  Then  t2  =>  t3  implies  l(t2)  >  I(t3).  By  Prop¬ 
erty  0.2  we  have  Aj^Ktj))  £  Aj'(I(t3)).  =>  t2  implies 

I(ti)  i  A j 1 (I (t2) ) .  Therefore  I (t,)  >Aj’(I(t3)).  There¬ 
fore  t,  =>  t3. 

(3.2)  Dj  t>  D,.  Then  t2  «>  t3  implies  I(t2)  >  I (t 3) .  t,  =>  t2 
implies  AjMKt,))  >  I(t2).  Therefore  A1j(I(t1))  >  I(t3). 
Therefore  t,  =>  t3. 

(4)  Di  ®  Dj  #  Dk.  Two  cases  are  considered: 

(4.1)  D<  t>  Dk.  Then  t1  =>  t2  implies  I(ti)  >  Ak1(I(t2)).  t2 
=>  t3  implies  Ak 1 (I (t2) )  >  I(t3).  Therefore  Kt^  >  I(t3). 
Therefore  t,  *>  t3. 

(4.2)  Dk  t>  d(.  Then  t,  =>  t2  implies  A1k(I(t1))  >  I(t2).  t2 
»>  t3  implies  I(t2)2  A4k(I(t3)).  Therefore  A1k(I(t1))  >  A < 
k(I(t3)).  By  Property  0.2  we  have  Kt,)  >  I(t3).  Therefore 
tt  «>  t3. 

(5)  D1  i*  Dk  **  Dj,  D<  i*  Dj.  Six  cases  are  considered: 

(5.1)  Dj  t>  Dk  t>  D  i .  Then  t,  ->  t2  implies  A^dttf))  >  I(t2). 
From  Property  0.1  and  0.2  we  have  A^dft,))  =  Akj(A1k(I(ti 
)))  >  Akj(I(t2)).  Therefore  A1J(I(t1))  >  Akj(I(t2)).  t2  => 
t3  implies  Akj(I(t2))  >  I(t3).  Therefore  A,J(I(t,))  >  I(t3 
).  Therefore  t1  =>  t3. 

(5.2)  D  i  t>  Dk  t>  Dj.  Then  t2  =>  t3  implies  I(t2)i  A 3  k ( I (t  3) ) . 

From  Property  0.1  and  0.2  we  have  Ak'(I(t2))  i  Ak  1  (A j k ( I (t 3 
)))  =  Aj 1 (I ( 1 3 ) )  .  t,  =>  t2  implies  I(t,))  >  Ak 1 (I (t2) ) . 

Therefore  Kt,)  i  AJ1(I(t3)).  Therefore  t1  =>  t3. 
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(5.3)  D j  t>  o ,  ♦»  Dk.  Then  t,  *>  t2  implies  I(ti)2  A^tHt^). 

From  Property  0.1  and  0.2  we  have  A^dtt,))  >  A4J(Ak4(I(t2 
)))  =  AkJ(I(t2)).  t2  =>  t3  implies  AkJ(I(t2))  >  I(t3). 

Therefore  Aii(I(t1))  >  I(t3).  Therefore  ti  =>  t3. 

(5.4)  Ok  ♦>  D,  t»  Dj.  Then  t<  *>  t2  implies  Aik(I(ti))  >  I(t2). 
t2  =>  t3  implies  I(t2)2  Ajk(I(t3)).  From  Property  0.1  and 
0.2  we  have  Aik(I(t1))  >  Ajk(I(t3))  *  A , k (Aj 1 (I (t3) ) )  There¬ 
fore  I (t ! )  >  Aj1(I(t3)).  Therefore  t,  *>  t3. 

(5.5)  Ok  t>  D j  t>  d,.  Then  t1  *>  t2  implies  Aik(I(ti))  >  r(t2). 
t2  =>  t3  implies  I(t2)2  Ajk(I(t3)).  Therefore  A{k(I(ti))  > 
A j k (I (t3) ) .  However,  A,k(I(t1))  =  Ajk(A1j(I(t1))).  There¬ 
fore  Aij(I(tf))  >  I (t 3) .  Therefore  t<  *>  t3. 

(5.6)  0,  f>  Dj  f>  0k.  Then  ti  *>  t2  implies  I(t,)>  Ak^Kt^). 

But  Ak 1 (I (t2) )  »  Aj 1 (AkJ (l(t2) ) ) .  And  t2  *>  t3  implies  AkJ 

(I(t2))  >  I(t3).  Therefore  AkUl(ta))  *  A41(Z(t3)).  There¬ 
fore  I(ti)  2  Aj1(I(t3)).  Therefore  ti  «>  t3. 

In  each  group,  we  have  permutated  the  order  of  levels  among  the  dis¬ 
tinct  transaction  classes  to  arrive  at  a  total  13  cases.  These  cases 
exhaust  all  the  possible  situations  that  govern  t1f  t2  and  t3  and  for 
every  situation  transitivity  is  shown  to  hold.  Therefore  we  conclude 
that  =>  is  critical-path  transitive.  Q.E.D. 


we  now 


define  the  following  synchronization  rule  and  show  that  if  a 


concurrency  control  algorithm  enforces  this  rule  then  the  transaction 
dependency  graph  is  cycle  free. 

Definition,  we  say  that  the  partition  synchronization  rule  (abbrevi¬ 
ated  as  PSR)  is  enforced  in  a  schedule  S  of  a  set  of  transactions  T  giv¬ 
en  a  data  segment  hierarchy  DSH*  if,  for  any  tt,  t2  e  T,  t,  -  t2  e  TG(S) 
implies  that  t-i  ->  t2. 

In  other  words,  a  concurrency  control  algorithm  enforces  the  parti¬ 
tion  synchronization  rule  if  it  allows  direct  dependencies  to  occur 
between  transactions  t1  and  t2  only  if  t<  *>  t2. 

We  now  prove  that  a  concurrency  control  algorithm  that  enforces  PSR 
is  also  correct.  This  involves  proving  that  a  concurrency  control  algo¬ 
rithm  that  enforces  PSR  will  only  produce  schedules  whose  transaction 
dependency  graph  is  cycle-free.  The  following  theorem  therefore  consti¬ 
tutes  our  proof. 

Theorem  1.  Let  TG(S)  be  a  transaction  dependency  graph  of  a  schedule 
S  of  a  set  of  udpate  transations  Tu  run  on  a  database  with  a  non-cyclic 
partition  P,  and  the  schedule  S  observes  the  partition  synchronization 
rule  with  respect  to  a  natural  data  segment  hierarchy  DSH*,  then  TG(S) 


has  no  cycles. 


Proof.  In  order  to  prove  Theorem  1,  we  first  give  the  following  two 


definitions  and  a  lemma  about  the  transaction  dependency  graph. 

Definition.  A  critical  path  dependency,  between  two  distinct  trans¬ 
actions  t !  e  D i  and  t2  e  DJf  denoted  as  CD(t1f  t2),is  a  cycle-free 
dependency  path  from  t,  to  t2  in  TG(S)  and  D,  and  D  j  are  on  a  critical 
path  in  DSH* ,  i  and  j  not  necessarily  distinct. 

Definition,  a  boundary  critical  path  dependency  in  tg(s)  between  two 
transactions  t ^  e  D,-  and  t2  e  Dj,  where  t,  **  t2,  denoted  as  BCD(tt,  t2 
) ,  is  a  CDft^  t2)  such  that  either  or  both  of  the  following  are  true: 

(1)  There  exists  t3  e  Tk  such  that  t<  -  t3  eCD(t1#t2)  and  Di,  Dj  and 

Dk  are  not  on  one  critical  path; 

(2)  There  exists  t4  e  Di  such  that  t4  -  t2  eCD(t1ft2)  and  Dif  Dj  and 

0 i  are  not  on  one  critical  path. 

These  two  concepts  are  illustrated  in  Figure  11. 

Property.  If  BCD(t1f  t4) ,  where  t ^  f  and  t4  «  Dj,  then  there 
exist  z2  «  3k  and  t3  e  Di,  t2,  t3  not  necessarily  distinct,  such  that 
CD  (t , ,  t2)  C  CD  ( 1 1 ,  t4)  ,  CD  ( 1 2 ,  t  - '  C  CD  (t  i ,  t4),  CD  ( 1 3 , 1 4 )  C  CD(t1#  t4 
)  and  D i ,  Dj,  Dk  and  Di  are  on  one  critical  path  in  DSH*.  (This  direct¬ 
ly  follows  from  the  fact  that  DSH*  is  a  semi-tree.) 


Lemma  1 .  If  there  exists  a  critical  path  dependency  CD(t1ft2)  in  a 
transaction  dependency  graph  TG(S(T))  where  the  schedule  S  enforces  the 
partition  synchronization  rule,  then  t,  =>  t2. 

Proof.  Let  i  be  the  length  (in  number  of  arcs,  i.e.,  direct  depend¬ 
encies)  of  a  critical  path  dependency.  Then  l  has  a  total  order  and  is 
bounded  from  below  by  1.  By  way  of  complete  mathematical  induction,  to 
prove  that  if  CD(ti,t2)  then  t,  =>  t2,  we  have  to  show  the  following: 

(1)  If  £(CD(tut2) )  =  l  then  t1  =>  t2. 

(2)  If  £(CD(t1,t2))  *  g  and  if  ta  =>  tb  for  all  ta,  tb  s.t.  there 

exists  CD(ta,tb)  and  £(CD(ta,tb) )  <  g,  then  t,  =>  t2. 

Now  we  prove  the  above  two  statements. 

(1)  In  this  case,  CD(t1,t2)=  tt  -  t2.  Since  S  enforces  the  partition 

synchronization  rule,  by  definition,  we  have  t,  =>  t2. 

(2)  To  prove  the  second  statement,  let  t3  e  Dk  and  t4  e  D,  be  such 

that  1 1  -  t3  e  CD(t ! ,  t2)  ,  t4  -  t2  e  CD(t  i ,  t2)  ,  and  a  path, 
denoted  as  Path(t3,t4),  from  t3  to  t4  such  that  Path(t3,t4)  C 
CD ( 1 1 , 1 2 ) .  Also  let  ti  £  D(  and  t2  £  Dj.  Consider  the  follow¬ 

ing  two  cases: 

(2.1)  If  CD ( 1 1 , 1 2 )  is  not  a  BCD,  then  Path(t3,t4)  is  a  CD(t3,t4 
).  Since  £(CD(t1ft2))  <  g  therefore  t,  =>  t2.  And  by  the 
definition  of  CD,  D,,  Djf  Dk  and  Dj  must  be  on  one  critical 
path  of  DSH«.  Therefore  we  have  t  ^  -  t3,  t4  -  t2  and  t3  »> 
t4.  By  Property  1.2  (i.e.,  the  property  of  critical  path 
transitivity)  we  have  t,  =>  t2. 
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(2.2)  If  CD (t  i , 1 2)  44  a  BCD,  then  by  the  property  above  of  a  BCD 

we  have  that  there  exist  t5  e  Dm  and  t6'e  Dn  such  that  CD(t, 

,ts)  C  CD(t1,t2),  CD (t  5, tg)  C  CD(t1(t2),  and  CD(t6,t2)  C 
CD ( t , , 1 2 ) ,  where  Dra,  Dn,  D  t  and  Dj  are  on  one  critical  path 
of  DSH* .  Since  £(CD(t1#  t5))<  g,  therefore  t,  =>  t5. 

Similarly,  t6  =>  t2  and  t5  =>  ts.  By  Property  1.2  we  con¬ 
clude  t ,  =>  t2.  Q.E.D. 

Now  we  are  ready  to  prove  Theorem  1. 

Theorem  1 . 

Proof .  Suppose  there  exists  a  cycle.  Then  the  cycle  involves  at 
least  two  transactions  t ^  and  t2  that  belong  to  transactions  that  are  on 
one  critical  path.  This  means  that  there  exist  CD(ti,t2)  and  CD(t2,t1). 

Ey  the  above  lemma,  CD(t1ft2)  implies  t,  =>  t2  and  CD(t2,t1)  implies  t2 

~>  t-,.  However,  =>  is  anti-symmetric  (by  spropll.).  Therefore  ti  *>  t2 
and  t2  =>  ti  cannot  be  true  at  the  same  time.  Therefore  there  can  be  no 
cycle  in  this  transaction  dependency  graph.  Q.E.O. 

We  conclude  that  if  a  concurrency  control  algorithm  enforces  the 
Paritition  Synchronization  Rule  then  it  is  correct.  What  is  left  is  to 
show  is  that  Protocols  H  and  E  enforce  this  rule.  Therefore  we  must 
show  that,  by  employing  these  protocols,  t,  ->  t2  implies  t,  =>  t,. 
This  is  translated  into  the  following  three  cases: 


(1)  If  t,  and  t2  are  rooted  in  the  same  data  segment,  the  algorithm 
must  allow  ti  to  read  a  version  v  of  a  data  granule  d  created  by  t 2,  or 
to  create  a  new  version  of  a  data  granule  d  whose  latest  version  dv  was 
created  by  t2,  only  if  t2  has  an  inititation  time  that  is  less  than  that 
of  t,.  (i.e.,  only  if  TS(dv)  <  I (t ,) . ) 

Protocol  3  of  our  algorithm  satisfies  this  requirement. 

(2)  If  ti  is  rooted  in  Di  of  a  lower  level  while  t2  in  Dj  of  a  higher 
level,  then  the  algorithm  must  allow  t 1  to  read  dv  created  by  t2  only  if 
t2  has  an  initiation  time  less  than  Aij(I(ti)).  (i.e.,  only  if  TS(dv)  < 
fli^Kt,)).) 

Protocol  A  of  our  algorithm  satisfies  this  requirement. 

(3)  If  ti  is  rooted  in  D(  of  a  higher  level  while  t2  in  Dj  of  a  lower 
level,  then  the  algorithm  must  allow  ti  to  create,  at  time  m,  a  new  ver¬ 
sion  of  a  data  granule  whose  predecessor  dv  has  been  read  by  t2,  only  if 
t,  has  an  initiation  time  greater  than  or  equal  to  Aj’(I(t2)). 

This,  however,  is  always  true  because,  by  the  very  fact  that  t2  by 
time  m  has  already  read  dv  we  Know  that  I(t2)  <  m,  and  therefore  Aj'dft 
2))  i  AjMm).  Also,  because  cf  the  fact  that  z ,  is  active  at  m  (i.e., 
not  committed  yet  at  m)  we  know  that  Aj’(m)  <  I(t,).  This  leads  to  the 
conclusion  that  I(t,)  i  Aj,(I(t2)). 
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5'. O'  SYNCHRONIZING  UPDATE  TRANSACTIONS  UNDER  CYCLIC  PARTITION 


In  this  chapter  we  will  extend  the  algorithm  developed  in  the  previ¬ 
ous  chapter,  which  handles  the  special  case  of  a  non-cyclic  parition,  to 
include  the  capability  of  .  handling  cyclic  partitions.  As  mentioned 
before,  the  difference  between  a  non-cyclic  and  a  cyclic  partition  is 
that  in  the  former  the  only  inter-class  accesses  are  the  read-only 
accesses  to  higher  data  segments,  while  in  the  latter  the  inter-class 
accesses  include  reads  and  writes  to  lower  data  segments  as  well.  In 
the  theorems  proven  in  this  chapter,  we  will  show  that  the  existence  of 
accesses  to  lower  data  segments  in  a  data  segment  hierarchy  can  be 
treated  independently  from  the  treatment  of  those  that  are  intra-class 
or  to  higher  data  segments.  That  is,  to  deal  with  non-cyclic  partitions 
in  which  accesses  to  lower  data  segments  exist,  one  needs  only  to  add 
another  protocol  on  top  of  the  protocols  already  devised  in  the  previous 
chapter  for  intra-class  and  higher-level  accesses. 

We  will  first  give  the  definitions  of  some  functions  that  are  needed 
in  describing  the  protocol.  Then  we  provide  a  description  of  the  proto¬ 
col,  followed  by  am  example  of  the  use  of  the  protocols  and  the  proof  of 


correctness. 


5.1  BASIC  DEFINITIONS 


The  following  definitions  assume  that  a  data  segment  hierarchy  DSH* 
is  given  for  a  cyclic  database  partition.  To  compute  the  timestamps  a 
transaction  uses  to  access  data  segments  lower  than  the  transaction's 
own  root  segment,  we  will  now  describe  the  functions  Ci18*®  and  Bj 1  that 
can  be  considered  conceptually  the  inve.fi/tt  of  functions  li°lcj  and  A^. 
These  functions  are  illustrated  in  Figure  12. 


Definition.  Let  C,late:  m  -  m'  be  a  function  which  maps  a  time  m  to 
another  m'  where  D(  is  a  data  segment  and  Ci,8te(m)  is  determined  as 
follows. 


C,,a,»(m) 


m  if  there  exists  no  t  e  D{  active  at  time  m. 

Max  (C (t) )  otherwise,  where 

t  e  0 i ,  I(t)  <  m  and  C(t)  >  m. 


That  is,  C,1at®(m)  is  the  IcutZAt  commit  time  of  all  transactions  rooted 
in  Di  that  started  before  time  m. 


While  the  A  function  maps  a  time  in  a  lower  segment  to  the  initiation 
time  of  some  transaction  rooted  in  a  higher  segment,  the  B  function  maps 
a  time  in  a  higher  segment  to  the  commit  time  of  some  transaction  rooted 
in  a  lower  segment: 
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Commit  cine  of  the  transaction  that 
commits  the  latest  among  all  trans¬ 
actions  rooted  in  Di  active  at  time  m 


Definition.  The  Backward  activity  link  function,  defined  for  a  pair 
of  data  segments  Dj  and  Dj#  where  Dj  t>  Dj,  denoted  as  Bj’(m),is  a 
function  which  maps  a  time  value  m  to  another  such  that 

m  if  Dj  5  Dj 

B j  1  (m)  =  C4late(m)  if  D,  -  Dj  *  CP jj 

Bki(Bjk(m))  otherwise,  where  Dj  -  ...  -  Dk  -  Dj  =  CPjj. 

5.2  THE  SYNCHRONIZATION  PROTOCOL 

Now  we  introduce  the  hierarchical  timestamp  protocol  for  cyclic  par¬ 
tition. 

Hi erarchi cal  Timestamp  Protocol  For  Update  Transactions: 

For  every  database  access  request  from  an  update  transaction  t  e  Dj 
for  a  data  granule  d  e  DJf  the  following  protocol  is  observed: 

Protocol  E  < for  accessing  a  segment  Equal  to  root  segment) 

If  Dj  =  Dj,  then 

(1)  If  it  is  a  read  request,  then  grant  the  latest  version  before 
I ( t )  of  d,  and  leave  I ( t )  as  the  read  timestamp  of  this  version 
of  d  if  its  current  read  timestamp  is  smaller  than  I(t). 


(2)  If  it  is  a  write  request,  then  if  the  read  timestamp  of  the  lat¬ 
est  version  before  I(t)  of  d  is  smaller  than  I(t),  then  create  a 
new  version  of  d  with  version  number  I(t).  Otherwise  abort  t. 

Protocol  H  < For  accesssing  Higher  segments ) 

If  Dj  t>  D,  then  grant  t  access  to  the  latest  version  before  Aij(I(t)) 
of  d. 

Protocol  L  (For  accessing  Lower  segments) 

If  D,  t>  Dj,  then 

(1)  If  it  is  a  read  request,  then  grant  the  latest  version  before  Bjj 

(I(t))  of  d,  and  leave  Bjj(I(t))  as  the  read  timestamp  of  this 
version  of  d  if  its  current  read  timestamp  is  smaller  than  B,j 
(Kt)). 

(2)  If  it  is  a  write  request,  then  if  the  read  timestamp  of  the  lat¬ 
est  version  before  Bjj((t))  of  d  is  smaller  than  B,j(!(t)),  then 
create  a  new  version  of  d  with  version  number  B1j(I(t)).  Other¬ 
wise  abort  t. 

Several  observations  are  made  of  this  set  of  protocols: 

(l)  If  the  database  segment  hierarchy  DSH*  consists  of  a  single  data 
segment,  then  only  Protocol  E  will  apply,  and  the  hierarchical 
timestamp  algorithm  is  reduced  to  the  conventional  MVTS  algo¬ 
rithm. 
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(2)  Since  no  transaction  will  write  a  data  segment  higher  than  its 
own  root  segment,  Protocol  H  needs  to  cover  only  read  accesses. 
More  importantly.  Protocol  H  is  'cheaper'  than  either  Protocol  E 
or  Protocol  L,  since  it  does  net  require  timestamping  the  data 
element  accessed. 

(3)  Protocol  L  is  essentially  the  same  as  Protocol  E  with  the  excep¬ 
tion  that  the  timestamps  used  are  different  from  the  timestamps 
of  the  accessing  transactions.  Since  B,j(I(t))  i  I(t),  Protocol 
L  is  the  most  'expensive'  among  the  three,  in  addition  to  the 
difficulty  discussed  in  the  following  paragraph. 

There  is  a  difficulty  associated  with  implementing  Protocol  L.  To 
compute  B,j(I(t))  as  a  timestamp  to  synchronize  access  requests  from  a 
transaction  t  is  a  tricky  matter  since  theoretically  BiJ(l(t))  is  not 
'computable'  until  at  least  after  transaction  t  has  committed.  This 
dilemma  can  be  resolved  by  artificially  computing  (i.e.,  'guessing')  the 
value  of  B  functions  and  enforcing  it  at  a  later  time.  To  do  so,  an 
algorithm  is  designed  to  compute  pseudo-Ci 1 ate  values,  and  another  algo¬ 
rithm,  which  is  a  recursive  application  of  the  first  algorithm,  is  used 
to  computer  pseudo-B^  values.  The  first  algorithm  returns  the 
psuedo-C , 1 ate  value  by  adding  a  constant  a*  >  0,  generic  to  a  data  seg¬ 
ment  D,,  to  the  argument  value.  In  addition,  it  inserts  constraints  for 
concurrency  control  mechanism  to  enforce  the  validity  of  this  computa¬ 
tion  at  a  later  time.  These  two  algorithms  are  as  follows: 
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pseudo  C  (D*,  m) :  Function. 

Insert  a  pseudo  transaction  t'  e  Df  such  that  I(t’)  =  m  and  C(t')  =  m  + 
a  i . 

Insert  constraint:  Abort  all  tM  e  D1  s.t.  I(t")  <  m  and  C(t")  >  m  +  a.\ 

Return  (m  +  a 1 ) . 

(Assume  D,  -  0 , 1  -  D12  "*  •••  -  Dj  =  CP(J) 
pseudo_b  (D i ,  Dj,  m) :  Function 

B  •  pseudo_c  (Dii*  pseudo_c  (Di2*  •••  pseudo_c  (Dj,  m)  ...)). 

Return  (B) ; 

If  ai's  are  selected  in  such  a  way  that  a  large  portion  of  the 
non-pseudo  transactions  rooted  in  Di  can  be  executed  in  tine  less  than  a 
,,  the  chance  of  having  to  abort  transactions  in  in  order  to  enforce 
the  constraint  (i.e.,  step  2  of  the  algorithm  pseudo_c)  is  relatively 
small . 

5.3  AN  EXAMPLE 

In  this  section  we  illustrate  the  use  of  protocols  H,  E  and  L  for 
controlling  update  transactions  through  a  simple  banking  example.  The 
scenario  is  as  follows.  The  database  is  composed  of  ohree  types  of 
information  about  customers:  overdraft  limits,  demand  deposit  account 
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balances  (or  simply  'account  balances'),  ana  loan  account  balances. 
These  types  of  information  are  organized  into  three  data  segments  with  a 
data  segment  hierarchy  shown  in  Figure  13.  Also,  three  types  of  update 
transactions  are  run  against  the  database:  overdraft  limit  update, 
deposit  or  withdrawal  against  a  demand  deposit  account,  and  loan  payment 
or  new  loan  approvals.  In  the  interleaved  transaction  execution  to  be 
demonstrated  here,  a  set  of  four  transactions  are  run  concurrently,  and 
the  tasks  involved  in  these  transactions  are  also  shown  in  Figure  13, 
where  dl,  d2  and  d3  are,  respectively,  the  overdraft  limit,  the  account 
balance  and  the  loan  balance  of  a  customer  Mr.  Smith's.  The  timing  of 
the  interleaved  read  and  write  requests  of  these  transactions  to  be 
input  to  the  concurrency  control  algorithm  are  shown  in  Figure  14. 
Assuming  that  the  current  versions  of  dl,  d2  and  d3  are  all  version  0 
and  are  all  before  the  initiation  time  of  the  earliest  transaction  in 
the  set,  i.e.,  I(tl),  the  control  responses  that  the  concurrency  control 
facility  would  generate  for  each  request  are  shown  in  Figure  14.  As 
shown  in  the  figure,  every  request  is  granted  at  the  time  the  request  is 
issued  and,  in  this  example,  no  blocking  nor  aborts  are  induced. 

It  is  interesting  to  note  that  the  same  interleaved  schedule  of 
read/write  requests,  if  controlled  by  the  two-phase  locking  algorithm, 
would  result  in  blocking  and  a  deadlock.  Specifically,  when  t3  issued 
its  'read  dl'  request  at  time  10,  dl  was  at  that  time  read-locked  by  ti. 
Since  read  requests  do  not  conflict,  t3  would  be  granted  access  to  dl 
and  an  additional  read  lock,  owned  by  t3,  would  be  imposed  on  dl.  How- 


ever,  when  tl  issued  the  'write  di'  request  at  tine  ll,  it  would  be 


forced  to  wait  since  it  could  not  convert  its  read  lock  on  dl  to  a  write 
lock  when  dl  was  at  that  time  also  read-locked  by  t3.  Therefore,  tl 
could  not  commit  and  must  wait  for  t3  to  free  its  read  lock  on  dl.  How¬ 
ever,  when  t3  finally  committed  and  freed  its  read  lock  on  dl  at  time 
13,  dl  had  in  the  mean  time  also  been  read-locked  by  t4  at  time  12, 
therefore  tl  would  now  be  forced  to  wait  for  t4  to  commit.  Unfortunate¬ 
ly,  t4  could  not  commit  at  time  15  because  it  also  needed  to  write  dl 
and  dl  was  held  read-locked  by  tl.  Therefore,  tl  and  t4  deadlocked  at 
time  15,  and  one  of  the  two  transactions  must  be  backed  out  and 
restarted. 

Similarly,  if  this  interleaved  schedule  of  read/write  requests  is 
given  to  a  conventional  MVTS  aglorithm,  some  aborts  of  transactions 
would  result.  In  particular,  tl  would  be  aborted  at  time  ll  in  attempt¬ 
ing  to  write  dl  and  t2  would  be  aborted  at  time  8  in  attempting  to  write 
d2.  Therefore  this  example  also  demonstrates  a  scenario  in  which  the 
the  HDD  protocols  would  have  performed  better  in  terms  of  level  of  con¬ 
currency  than  the  two-phase  locking  algorithm  and  the  conventional 
multi-version  timestamping  algorithm. 


5.4  PROOF  OF  CORRECTNESS 


Given  the  rules  for  testing  for  serializability  (i.e.,  to  show  that 
the  transaction  dependency  graph  as  defined  before  is  non-cyclic  if  the 
above  protocol  is  enforced,)  the  following  steps  are  devised  to  prove 
correctness  of  the  HTS  algorithm: 

(1)  Show  that  direct  dependencies  may  occur  only  between  transactions 

that  are  rooted  in  segments  that  are  related  in  DSH* . 

(2)  Making  use  of  Theorem  l  proven  in  the  previous  chapter  which 
asserts  that  if  a  schedule  S  enforces  the  relation 
'topologically  follows',  i.e.,  the  relation  '*>•  as  defined  in 
the  proof  of  the  previous  chapter,  on  all  direct  transaction 
dependencies  (i.e.,  t2  ->  t  TG(S)  only  if  t2  =>  t,)  then  the 
transaction  dependency  graph  TG(S)  has  no  cycle,  show  that  the 
HTS  algorithm  produces  only  schedules  that  enforce  the  relation 
'=>’  on  all  direct  transaction  dependencies. 

Lemma  2.  Given  a  DSH*  and  let  t,  <  Df  and  t2  £  Dj.  If  t2  ->  t1( 
then  Di  and  Dj  are  related  in  DSH*. 

Proof .  Let  d  €  Dk  be  the  contended  data  element  that  causes  t2  ->  t 
i.  Since  at  least  one  transaction  writes  d,  we  have  either  D ,  -(*)  Dk 
-(=)  Dj  contained  in  DSG(P,TU)  or  Dj  -(=)  Dk  -(=)  D<  contained  in 
DSG(P,TU) .  By  the  definition  of  data  segment  hierarchy,  Di  and  D;  must 


be  related  in  DSH*. 


Given  the  Theorem  1  which  states  that  the  relation  =>  can  be  used  as 


a  vehicle  for  ordering  transactions  for  concurrency  control  purposes, 
the  following  theorem  completes  our  proof  of  correctness. 

Theorem  2.  Let  S  be  a  schedule  that  is  permitted  by  the  hierarchical 
timestamp  protocol.  Then  t2  ->  t1  e  TG(S)  only  if  t2  =>  t^  (i.e.,  The 
hierarchical  timestamp  protocol  enforces  PSR  under  a  cyclic  partition.) 

Proof.  we  will  first  present  the  following  3  properties  which  bind 
the  functions  A  and  3  together  are  used  to  transform  the  timing 
relationship  imposed  by  the  protocol  to  that  defined  by  the  relation  =>: 

Property  2.1 .  Aii(Bj,(m))  *  m,  where  D,  -  Di,  -  -  ...  -  LUr>.:)  - 
D,n  -  Dj  =  CP i j  in  the  data  segment  hierarchy  DSH* . 

Proof.  A i J (B j 1 (m) )  =  A ij(C11(...(C1n(CJ(m) ))...)).  (Cj  is  an  abbre¬ 
viated  expression  for  and  Ij  is  an  abbreviated  expression  for  Ij 

0ld)  Let  nij  *  Cj(m).  Then  mj  *  C(tj°)  if  there  exists  t  cDj  active  at 
time  m  and  Cj(m)  *  C(tj°),  and  mj  =  m  if  there  exists  no  t  (  Dj  active 
at  time  m.  Therefore  AiJ(Bj'(m))  *  A , 1 (C i , ( . . . (C i n (m j ) ) . . . ) ) .  Continue 
substitution  of  the  L  function  in  the  expression  with  similarly  defined 
m,n,  ...,  mi1r  we  get  A1J(B.'(m))  *  Now  we  start  spelling  out 

the  function  A , i :  A,J(mM)  =  A, , 3  (I , ,  (m, ,) ) .  Consider  the  following 

two  cases: 

(1)  If  there  exists  no  t  e  D,,  active  at  m11f  then  Zn(m,i)«  n^. 

Since  m,,  =  Cnliiij!  i  mt2,  we  have  l,1(mi1)2  m(2. 
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(2)  If  there  exists  t  £  D(1  active  at  m11f  then  I<i(mn)«  I(t(1')r 


where  tn'i*  t^®  (since  t,,1  is  active  at  in  while  t,,0  commits 
at  inn),  and  Ift,,')  >  m12  (since  if  I(tn')  <  m12  then  during  the 
previous  application  of  Cf1,  C<1(m<2)  should  be  equal  to  C(tn') 
and  not  c(tn°),  and  contradicts  the  assumption.)  Therefore  I ^  t (m 

1  -I )  -  ®  i  2* 

Therefor  we  conclude  1 11(111^)2  m12.  By  the  same  reasoning  we  continue 
spelling  out  the  A  function  to  arrive  at  the  following:  A1J(Bj1(m))  *  A 
1nJ  (I  in(m1n) )  •  Since  I1n(m'in)  i  itj  =  Cj(m),  we  have  Aij(Bji(m))  >  A1nj 
(Cj(m)).  Since  AinJ(Cj(m))  =  Ij(Cj(m))  Z  m,  we  have  A1-s(Bj1(m))  Z  m. 
Q.E.D. 

Property  2.2.  A^tBj'On)  -  £)  <  m.  where  D,  **  Di,  -  -  ...  -  d,,,,.,, 
-  D1n  -  Dj  =  CPjj  in  the  data  segment  hierarchy  DSH»,  and  e  a  small  val¬ 
ue. 

Proof.  Let  mj,  m1nf  ...,  m*  be  defined  in  the  same  way  as  in  the 
proof  of  Property  2.1.  We  have  A  * j  (B  j 1  (m)-£)  =  Aij(mM  -  e)  =  A(1j(l(1 
(m i  ^  -  £)).  Now  we  show  that  InUn  -  £)  <  m12.  Consider  the  follow¬ 

ing  two  cases: 

(1)  If  there  exists  no  t  e  D(1  active  at  m12,  then  mM  =  C(1(m,2)  =  m12 
.  Therefore  1^(111^  -  e)  =  Iii(m12  -  e)  i  m12  -  e  <  ml2. 


If 

there 

exists  t  £  D , , 

active  at  m ( 2 ,  then 

"'ll  * 

Cl 

1  (m i 2)  =  C(t,  ,® 

) 

where 

I (t i ,°)  <  m12. 

Therefore  I , .  (m i 1  - 

£ )  = 

Xi 

1(C(ti1°)-  £ )  S 

T 

-101- 


2 


Therefore  we  conclude  Ijifnijj  -  e)  <  mi2.  Let  mu'  =  IM  (mi1  -  e) . 
Then  m ( t ’  <  m)2,  an d  AjMBjMm)  -  e)  =  A^Main')*  Continue  the  process 
of  substitution  we  have  Ai^Bj^m)  -  e)  ■  A ^ nj  (m , n' )  =  Ij(m1n')  where  m 
in’  <  m j .  But  Ij(m1n')  S  I j (m j  -  t)  =  Ij(Cj(m)  -  e)  <  m.  Therefore  Aij 

(Bj ’ (m)  -  «)  <  m.  Q.E.D. 

Property  2.3.  A,J(Bj  *(i(t)))  >  I (t)  where  Dj  t>  Dj  and  t  «  Df  is  a 
transaction. 

Proof .  Consider  the  following  two  cases: 

(1)  If  Di  ->  Dj  is  a  critical  arc  in  DSH* ,  then  Aj i (B j  ’  (I (t) ) )  *  Ij(C 

j(J(t))).  Since  all  transactions  that  start  at  or  before  I(t) 
have  committed  by  the  time  indicated  by  Cj(I(t)),  one  concludes 
that  I j (C j (I (t) ) )  >  I(t). 

(2)  If  D,  ->  ...  ->  D1n  ->  Dj  C  DSH*,  then  Ai j (Bj 1 (I (t) ) )  *  Ij(A1,n(B 

1 n1 (Cj (I (t) ) ) ) .  By  Property  2.1,  we  have  the  above  expresssicn 
2  I . (Cj (I (t) ) ) ,  which  by  (l)  is  greater  than  I(t).  Q.E.D. 

Now  we  are  ready  to  prove  Thereom  2.  Let  t2  e  Dj  and  t(  e  Dif  and  t2 
->  t,  due  to  a  contended  data  element  d  e  Dk.  To  show  that  t2  ->  t, 
implies  t2  =>  tt,  we  must  consider  all  the  possible  permutations  in  the 
hierarchical  relationship  in  DSH*  among  D < ,D j ,  and  Dk.  Since  at  least 
one  of  the  transactions  t,  and  t2  must  have  write  accesses  to  Dk,  there¬ 
fore  the  permutations  where  Dk  is  higher  than  both  D<  and  Dj  are  ruled 
out.  This  leads  to  four  different  permutations  of  Dj,  D;  and  Dk  where  D 
t /  Dj  and  Dk  are  distinct.  We  also  must  consider  cases  when  the  Di,  D; 


and  Dk  are  not  all  distinct.  This  leads  to  a  total  of  6  legitimate 
cases  where  at  least  two  of  the  data  segments  are  merged.  Therefore 
there  are  a  total  of  10  different  relationships  among  Dif  Dj  and  Dk  that 
must  be  examined. 

For  each  case,  we  must  further  consider  the  following  three  scenarios 
that  lead  to  t2  ->  t  ^ : 

(1)  tj  reads  a  data  element  d  e  Dk  written  (created)  by  t,. 

(2)  t2  writes  (i.e.,  creates  a  new  version  dv  of)  a  data  element  d  e 

Dk  whose  predecessor  d°  was  read  by  t,. 

(3)  1 2  writes  (creates  a  new  version  dv  of)  a  data  element  d  e  Dk 
whose  predecessor  d°  was  written  (created)  by  t,  and  the  version 
created  by  t2  is  read  by  another  transaction. 

Now  we  consider  the  following  10  cases: 

(1)  Di  t>  Dj  t>  Dk.  In  this  case,  obeying  HTS  leads  to  the  same 

assertion  for  all  three  scenarios:  Bjk(I(t2))  >  Bik(I(ti)). 
From  this  and  the  properties  2.1  -  2.3  we  can  deduce  the  follow¬ 
ing:  Aj1  (I (t 2) )  2  AJ1(AkJ(Bjk(I(t2))  -  €))  =  Ak ’ (B j  k (I (t2) )  - 

e)  2  Ak  1  (B,  k(I  (t  t) )  >  I(t,).  Therefore  t2  =>  t,. 

(2)  Di  t>  Dj  =  Dk.  In  this  case,  obeying  HTS  leads  to  the  same 

assertion  for  all  three  scenarios:  I(t2)  >  B1k(I(t1)) .  From 
this  and  the  properties  2.1  -  2.3  we  deduce  the  following:  A.1 

(I(t2))  >  Ak  1  (B  , k ( I (t  i ) )  >  I(t,).  Therefore  t2  =>  t,. 
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assertion  for  all  three  scenarios:  Bjk(I(t2))  >  B1k(I(t1)). 
From  this  and  the  properties  2.1  -  2.3  we  deduce  the  following: 
I(ta)  >  AkMBjk(I(t2))  -  «))  *  AkJ(B1k(I(t1)))  =  Ajj(Ak’(Bjk(I(t 
i))))  i  A,MHt,)).  Therefore  t2  =>  ti. 

Dj  t>  Dj  =  Dk.  In  this  case,  obeying  HTS  leads  to  the  same 
assertion  for  all  three  scenarios:  Bjk(I(t2))  >  I(t,).  From 
this  and  the  properties  2.1  -  2.3  we  deduce  the  following:  I(t2 
)  >  Ak->(Bjk(l(t2))  -  e))  >  Akj(l(t,))  =  AiJ(l(ti)).  Therefore 
t2  =>  t ,. 

D,  t>  Dk  t>  Dj.  In  this  case,  the  only  possible  scenario  that 
applies  is  scenario  (I).  Under  this  scenario,  obeying  HTS  leads 
to  the  following  assertion:  Ajk(I(t2))  >  B,k(I(t1)).  From  this 
and  the  properties  2.1  -  2.3  we  deduce  the  following:  Aj’(I(t2 
))  *  Ak ’ (Aj * (I (t2) ) ) 2  Ak ' (B , k(I (t ,) ) )  >  I(ti).  Therefore  t2  => 
t,. 
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(8)  D i  *  3k  Dj.  In  this  case,  the  only  possible  scenario  that 
applies  is  scenario  (1) .  Under  this  scenario,  obeying  HTS  leads 
to  the  following  assertion:  Ajk(I(t2))  >  I (t i ) •  Therefore  t2  => 

ti- 

(9)  Dj  t>  Dk  t>  d i .  In  this  case,  only  scenario  (2)  applies. 

There  are  two  sub-cases: 

(9.1)  d  already  exists  when  t,  read  d°.  By  HTS  protocols  this  means 

B j  * (I (t2) )  *  Aik(I(t1)).  However,  given  that  no  two  timestamps 

are  identical  and  Bj1(I(t2))  is  a  commit  timestamp  and  A , k 1 1 ( 1 1 
))  is  an  initiation  timestamp,  3jk(I(t2))  *  Aik(I(t1)).  There¬ 
fore  we  have  Bj’(I(t2))  >  k  ^ (l  (t  ^) ) .  From  this  and  the  proper¬ 
ties  2.1  -  2.3  we  can  deduce  the  following:  I(t2)  >  Akj(Bjk(I(t 
2))  -  t)  i  AkJ  (A1k(I(t1)))  =  AiMKt,)).  Therefore  t2  =>  t,. 

(9.2)  d  does  not  exist  when  t,  reads  d°.  Let  the  time  when  t2  writes 

be  denoted  as  time2  and  the  time  when  tl  reads  be  denoted  as 
timel,  then  we  have  the  following:  Bj’(I(t2))-  C(t2)  >  time2  > 

timel  >  Kt,)  i  Aik(I(t1)).  Therefore  we  obtain  Bj1(I(t2))  >  A, 
k  ( I  ( 1 1 ) ) .  By  similar  reasoning  we  conclude  that  t2  =>  tf 

(10)  Dj  t>  Dk  =  Di.  In  this  case,  only  scenario  (2)  applies.  There 
are  also  two  sub-cases: 

(10.1)  d  already  exists  when  t,  read  d°.  By  HTS  protocols  this  means 
B  j 1  (I ( 1 2 ) )  -  A^dftj)).  By  similar  reasoning  as  above  we  have 

Bj’(I(t2))  >  Kt,).  From  this  and  the  properties  2.1  -  2.3  we 

deduce  the  following:  I(t2)  >  AkJ ( B j k ( I (t2) )  -  e)  >  AkJ(I(t,)) 
=  AiMKt,)).  Therefore  t2  =>  t -i . 
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(10.2)  <3  floes  not  exist  when  ti  reads  fl°.  By  similar  reasoning  as 

(9.2)  we  conclude  t2  *>  t,.  then  we  have  the  following:  Bj ’ 
(I(t2))  *  C(t2)  >  time2  >  timel  >  l(t,)  z  A1k(I(t1)).  Therefore 
we  obtain  Bji(I(t2))  >  A,k(I(tt)).  By  the  same  reasoning  as 

above  we  conclude  that  t2  =>  t < .  Q.E.O. 


Theorem  2  concludes  that  the  hierarchical  timestamp  protocols  under 
cyclic  partitions  enforce  the  partition  synchronization  rule.  This, 
combined  with  the  result  of  Theorem  1,  completes  the  proof  of  correct¬ 
ness  of  the  algorithm. 
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6.0  SYNCHRONIZING  READ-ONLY  TRANSACTIONS 


What  has  been  discussed  is  the  algorithm  for  controlling  concurrent 
update  transactions.  Now  we  turn  to  the  read-only  transactions. 

For  a  read-only  transaction  t  that  reads  from  data  segments  that  lie 
on  one  critical  path  CP . J  of  the  DSH* ,  the  protocol  that  should  be 
observed  is  the  same  as  that  observed  by  the  update  transactions  rooted 
in  a  data  segment  immediately  below  the  lowest  data  segment  on  the  crit¬ 
ical  path  CP^  in  DSH*,  namely,  the  data  segment  right  below  0 , .  (If 
there  exists  no  data  segments  below  D,  in  DSH*,  then  a  fictitious  data 
segment  can  be  created  to  'host'  this  read-only  transaction.)  Therefore 
read-only  transactions  will  have  to  obey  protocol  H  alone  and  will  not 
cause  any  read  timestamp  or  read  lock  to  be  generated.  This  is  graph¬ 
ically  presented  by  transaction  ti  in  Figure  15. 

What  we  are  concerned  with  here  are  those  read-only  transactions  that 
read  from  any  combination  of  data  segments  that  do  not  lie  on  a  critical 
path  in  the  data  segment  hierarchy,  DHS* ,  as  illustrated  by  transaction 
t2  in  Figure  15.  In  general,  what  we  are  looking  for  is  a  way  to  prove 
the  existence  of  and  to  derive  a  consistent  database  state  across  all 
data  segments  in  a  database  in  which  the  hierarchical  decomposition 
approach  to  concurrency  contorl  is  used.  In  ether  words,  if  a  consist- 


ent  database  state  can  be  derived,  then  it  can  be  read  by  read-only 
transactions  without  violating  serializability.  To  achieve  this,  we 
first  introduce  the  extended  activity  Zink  function  in  the  following 
subsection,  and  prove  certain  properties  of  this  function  that  would 
enable  us  to  derive  a  consistent  database  state. 


6.1  BASIC  DEFINITIONS  AND  PROOF  OF  PROPERTIES 


In  the  previous  section  we  have  introduced  the  activity  link  function 
which  centers  around  the  linkage  between  transactions  rooted  in  data 
segments  that  are  on  a  critical  path  in  the  data  segment  hierarchy.  The 
extended  function,  on  the  other  hand,  specifies  how  transactions  rooted 
in  a  data  segment  are  linked  to  transactions  rooted  in  another  data  seg¬ 
ment  when  there  is  not  necessarily  any  critical  path  in  DSH*  that 
connects  the  two.  This  function  is  used  to  provide  a  way  of  computing  a 
C on&i&£e.nt  doutabaie.  htoAc  that  can  be  accessed  by  a  read  only  trans¬ 
action  that  reads  from  any  combination  of  data  segments  in  the  database. 

The  extended  activity  link  function  is  defined  by  the  functions 
introduced  in  the  previous  chapters,  namely,  functions  A  and  B.  Its 
usefulness  will  be  indicated  in  a  lemma  that  follows.  The  existence  and 
derivation  of  a  consistent  database  state  is  given  in  theorem  3,  which 


makes  use  of  the  extended  activity  link  function. 


Definition.  An  undirected  critical  path,  denoted  as  ucp(j,  is  an 
ordered  set  of  cU&tXnct  indices  of  data  segments  in  DSH*,  such  that  UCP^ 
i  *  <i,  il,  i2,  ...»  in,  j>  where  for  any  two  indices  h,  k  adjacent  in 
the  set,  either  Dh  ->  Dk  or  Dk  ->  Dh  is  a  crtical  arc  in  DSH* . 

It  is  obvious  that  for  any  data  segment  hierarchy  DSH*  there  exists 
one  and  only  one  UCP  in  DSH*  between  any  pair  of  data  segments.  While 
the  activity  link  function  A  is  defined  for  a  pair  of  data  segments  that 
lie  on  a  critical  path,  the  extended  activity  link  function,  using  the 
concept  of  UCP,  is  defined  for  any  pair  of  data  segments. 

Definition.  The  extended  activity  link  function  defined  for  a  pair 
of  data  segments  D,  and  Dj,  denoted  as  E(J(m),  is  a  function  which  maps 
a  time  value  m  to  another  such  that 


E,J (m) 


m  if  i  *  j, 

Cilate(m)  if  i  *  j  and  D^  ->  Di  is  a  critical  arc  in  DSH*, 
Ij°’a(m)  if  i  #  j  and  D*  ->  Dj  is  a  critical  arc  in  DSH*, 
Ek  •*  (E  i k  (m) )  otherwise,  where  <i,k,...,j>  =  UCPiJ, 


As  an  example,  suppose  D,,  Dj,  Dk  and  Dh  are  in  DSH*  such  that  D,  -> 
Dk,  Dh  ->  Dk  and  Dh  ->  Dj,  but  D1  and  Djare  not  related  in  DSH*,  as 
shown  in  Figure  16.  Then  UCP(J  =  <i,k,h,j>  and  Ejj(m)  is  evaluated  as 
follows.  Take  the  first  pair  of  indices  <i,k>  in  UCP(j  and  examine  the 
relationship  between  D ,  and  Dk.  Since  D<  ->  Dk,  therefore  Ik°Ia  is 
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invoked  and  Ejj(m)  is  expanded  to  be  Ekj  (Ik°1a(m) )  =  Ekj(m').  Then 
examine  the  relationship  between  the  next  pair  of  data  segments  Dk  and  D 
n.  Since  Dh  _>  Dk,  therefore  Ck1st*  is  applicable  and  Ekj(m')  is 
expanded  to  be  Ehj (Ck 1 ate(m' ) )  =  Ehj(m'*).  Finally,  the  relationship 
between  the  last  pair  of  data  segments  on  UCPjJ  is  examined  and  since  Dh 
->  Dj,  therefore  Ij°ld  is  applicable  and  Ekj(m'')  is  equated  to  Ij°1d 
(m'').  In  sum,  Ejj(m)  is  expanded  to  be  Ijola(Ck1ate(Ik° 1 a(m) ) ) .  This 
is  illustrated  in  Figure  16. 


:: 
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The  usefulness  of  the  extended  activity  link  function  lies  in  the 
fact  that  it  can  be  used  to  compute  the  components  of  a  time  wall. 
Intuitively,  a  £ imz  mJU.  for  all  data  segments  in  the  database  system  is 
a  set  of  times  such  that  no  direct  dependency  from  the  'older  side'  of 
the  wall  to  the  'newer  side'  of  the  wall  can  occur.  A  time  wall  TW(m,s) 
is  composed  of  an  ordered  set  of  all  time  values,  the  i-th  component  of 
which  is  expressed  as  E^fm),  where  m  is  a  time,  Ds  is  a  chosen  data 
segment,  and  D*  is  any  data  segment.  The  function  E  has  the  property 
that,  given  any  pair  of  data  segments  D,  and  D:,  if  t,  e  Di  and  t, 
starts  before  the  i-th  component  of  a  time  wall,  and  t2  e  Dj  and  t2 
starts  after  the  i-th  component  of  the  time  wall,  then  ti  cannot  direct¬ 
ly  depend  on  t2  This  property  is  formalized  in  the  following  lemma 
while  the  concept  of  a  time  wall  is  graphically  presented  in  Figure  17. 

Lemma  3.1 .  Let  Dk,  D,  and  Dj  be  data  segments  in  a  DSH*  of  a  data¬ 
base  partition,  and  D,  and  Dj  are  on  one  critical  path  in  DSH«.  Then 
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for  any  time  value  m  and  t ,  t  Dif  t2  «  Oj,  if  I(t,)  <  Ek1(m)  and  I (t 2) - 
Ekj(m)  tnen  there  exists  no  t,  ->  tj  in  the  transaction  dependency  graph 
TG(TU)  where  the  concurrency  control  algorithm  enforces  the  partition 
synchronization  rule. 

Proof.  Let  Dk1  be  the  class  such  that  lei  is  the  fisrt  index  in  UCPk' 
where  Dk1  and  Dif  Dj  are  on  one  critical  path,  (k  and  kl  are  not  neces¬ 
sarily  distict.)  Then  kl  will  also  be  the  first  such  index  in  UCPkJ, 
and  the  subset  of  the  ordered  set  UCPk’  up  to  kl  and  that  of  UCPkj  up  to 
kl  are  equivalent.  (This  is  because  between  any  pair  of  data  segments 
there  is  one  and  only  one  UCP.)  Consider  the  following  four  groups  of 
cases: 

(1)  i=j*klori*j=  kl.  In  this  case,  Ek’(m)  -  Ekj(m).  Since 
t.,  and  t2  are  in  the  same  class,  by  intra-class-synchronization 
rule  we  have  I  (t  i ) <  I(t2),  which  implies  that  there  exists  no 
ti  -  t2. 

(2)  i  *  kl  *  j.  Two  cases  are  considered: 

(2.1)  0 ,  t>  d3.  I(t2)  >  Ekj(m)  implies  that  AJi(I(t2))  >  Aj ’ (Ek 

J(m))  =  A j k 1 (Bk , j (Ekk 1 (m) ) ) .  From  Property  2.1  we  have  Ajk1 
(Bk , j (Ekk 1 (m) ) )  i  Ek k  1  (m)  =  Ek’(m)  >  I(tt).  Therefore  Aj  1 
(I(t2))  >  Ktj),  which  implies  that  there  exists  no  t ,  -  t2. 

(2.2)  Dj  t>  D..  I(t,)  <  Ek  1  (m)  implies  A,-*  (I(t,) )  <  Ai-(Ek’(m)) 
=  EkJ(m)S  I(t2).  Therefore  AjMKt,))  S  I(t2),  which 
implies  that  there  exists  no  tj  -  t2. 

(3)  j  *  kl  *  i.  Two  cases  are  considered: 


(3.1)  D i  t>  Dj.  I(t2)  *  Ekj(m)  implies  that  AJ,(I(t2))  i  Aj’(Ek 

j(m))  =  Ek1(m)>  I(t,).  Therefore  Aj*(X(t2))  >  Ktj),  which 

implies  that  there  exists  no  t ,  -  t2. 

(3.2)  Dj  t»  d,.  I  (t  i )  <  EiJOn)  implies  Kt^S  EkHm)  -  t,  which 
implies  A^dftt))  i  A  < 3  (B  j i  (Ek  3  (m) )  -  «).  From  property  2.2  we 
have  Aij (Bj 1 (Ekj (m) )  -  e)  <  Ekj(m).  Since  I(t2)  Z  Ekj(m), 

therefore  A1j(l(t1))  <  I(t2),  which  implies  that  there  exists  no 
t,  -  t2. 

(4)  i  *  j  *  kl.  Six  cases  are  considered: 

(4.1)  D,  t>  Dk1  t>  Dj.  I(t2)  i  EkJ(m)  implies  that  Aji(I(t2))  Z 

Aj^Ek^m))  =  A  j 1  (Bk  i 3  (Ekk  1  (m) ) )  =  Ak  , 1  (Aj  k  1  (Bk  ,  j(Ekk  1 

(m)))).  By  property  2.1,  Ak1 1 (Aj k  1  (Bk1 j (Ekk 1 (m) ) )  >  Ak1’(Ek 
k 1 (m) )  =  Ek'(m).  Since  Ek’(m)  >  I(ti),  we  have  AJ1(I(t2)>  > 
Kti),  which  implies  that  there  exists  no  t(  t2. 

(4.2)  Dj  t>  Dk1  t>  D I(ti)  <  Ek  1  (a)  implies  I  (t , )  <  EkMm)- 

£ ,  which  implies  A,Ml(t|))  i  A1j(Ek'(m)  -  e)  *  Ak,j(Aik1(B 
k i 1 (Ekk 1 (m) )  -  £ ) ) .  Let  m*  *  A<k1 (Bk. 1 (Ekk1 (m) )  -  «).  By 

property  2.2  we  have  m'  <  Ekk1(m).  Therefore  Ak  , J (A , k 1 (Bk , 1 
(Ekk  1  (m) )  -  £))  =  Ak1j(m-)  S  Ak  , J  (Ekk  1  (m) )  =  EkMm)  S  I(t2). 
Therefore  A,J(I(ti))  S  I(t2)  which  implies  there  exists  no 
1 1  —  t2. 

(4.3)  Dk1  >  D,  >  Dj.  I(t2)  Z  EkJ(m)  implies  AJi(I(t2))  Z  Aj 1 (Ek 

3  (m) )  =  A  j 1  (B  i J  (Ek  1  (m) ) )  >  EkVm)  >l(t,).  Therefore  Aj 1 

(I ( 1 2 ) )  >  I(ti),  which  means  that  there  exists  no  t ,  -  t2. 


(4.4)  Dj  ♦>  D ,  ♦>  Dk1.  I(t,)  <  Ek'(m)  implies  A1j(I(t,))  -  A, j 
(Ek  1  (m) )  =A,j(Ak1i(Ekk1(m))  =  Ak1MEkk1(m))  =EkJ  (m)  <  I(t2). 
That  is,  Ajj(I(ti))  i  I(t2)  which  means  there  exists  no  1 1  - 
t2. 

(4.5)  Dk1  t>  Dj  t>  D,.  I(t,)  <  Ek  1  (m)  implies  I(t,)<  EkMm)  - 

e,  which  means  Aij(I(t1))  i  AiJ(Ek’(m)  -  e)  «  Aij(Bj’(EkJ 
(m))  -  <0  <  EkJ (m)  S  I(t2).  That  is,  A1J(Z(t1))  <  I(t2), 

which  means  that  there  exists  no  ti  -  t2. 

(4.6)  D i  t>  Dj  t>  Dk1.  I(t2)  i  EkJ(m)  implies  Aj1(I(t2))  *  Aj ’ 
(Ekj(m))  =  Aj 1 (Ak1J(Ekk1(m))  *  Ek  ‘  (m) >  I  (t ,)  .  That  is  Aj 1 
(I(t2))  >  I(t,),  which  means  that  there  exists  no  t,  -  t2. 

For  each  of  the  group  amove  we  have  permutated  the  level  of  the  dis¬ 
tinct  classes  and  for  a  total  of  11  cases  we  have  shown  that  it  is 
impossible  to  have  ti  -  t2.  Therefore  we  prove  that  there  exists  no  t, 

-  t2.  Q.E.D. 

The  significance  of  this  time  wall  concept  is  that  if  a  read-only 
transaction  reads  the  latest  versions  of  data  granules  of  data  segment 
D,  which  are  right  before  the  time  indicated  by  the  time  wall  component 
Es’(m)  of  certain  time  wall  TW(m,s),  then  it  is  accessing  a  consistent 
database  state  and  will  in  no  way  induce  cycles  into  the  transaction 
dependency  graph.  This  discussion  is  formally  presented  in  the  follow¬ 
ing  theorem. 
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Theorem  3.  If  the  schedule  S  enforces  the  PSR  on  Tu,  and  for  every  d 
t  D,  that  a  read-only  transaction  tR  reads,  S  allows  it  to  read  the  ver¬ 
sion  d°  such  that 

T S(d°)  *  Max  (TS (d v) )  where  TS(dv)  <  Es'(m), 
for  some  time  m  and  some  transaction  class  index  s,  then  TG(S(TU  U  tR)) 
has  no  cycle. 

Proof.  In  order  to  prove  Theorem  3,  we  first  give  the  following 
definitions  and  a  lemma  (Lemma  3.2.) 

Definition.  a  consistent  transaction  set  with  respect  to  a  schedule 
S  of  a  set  of  transactions  T,  abbreviated  as  a  cs  w.r.t.  s(T),  is  a  set 
of  transactions  Tcs  C  T  such  that  if  t  €  Tcs  and  if  there  exists  t,  «  T 
such  that  t  -  ...  -  t i  C  TG(S) ,  (i.e.,  if  z  depends  on  t,  in  the  transi¬ 
tive  closure  of  -) ,  then  t,  e  Tcs. 

Property  3.1.  (The  Property  of  a  consistent  transaction  set.)  Parti¬ 
tion  Tu  into  Tu 1  and  Tu2.  Then  Tu1  is  a  consistent  transaction  set  w.r.t 
S(TU)  iff  for  any  two  transactions  t,,t2,  such  that  t,  e  Tu1  and  t2  « 
Tu2,  there  exists  no  t,  •*  t2  in  the  transaction  dependency  graph  TG(S). 

Proof .  We  want  to  show  that  the  following  two  parts  are  true: 

(1)  If  Tu1  is  a  CS  then  there  exists  no  t ,  -  t2. 

By  definition  of  a  CS,  if  t,  e  Tu1  and  t,  -  t2,  then  t2  must  be 
also  m  Tu 1 ,  which  contradicts  the  given.  Therefore  there 
exists  no  t,  -  t2. 


(2)  If  there  exists  no  t,  -  t2  for  any  ti  e  Tu1  and  t2  «  Tu2,  then  T 
u1  is  a  CS. 

Tu1  is  a  CS  because  no  transaction  in  Tu1  can  have  a  dependency 
in  the  transitive  closure  on  a  transaction  which  is  not  in  Tu1. 
Therefore  we  conclude  that  this  property  is  true.  Q.E.D. 

Definition.  Given  a  time  value  m  and  a  stojJUuig  data.  Atgmint  Vs,  a 
designated  consistent  transaction  set,  denoted  as  Tcs(m,s),  is  a  con¬ 
sistent  transaction  set  such  that  for  all  t  «  Ds,  t  e  Tcs(m,s)  iff  I(t) 
<  m. 

Lemma  3.2.  Parition  Tu  into  Tu1  and  Tu2.  Then  Tu1  is  the  designated 
consistent  transaction  set  Tca(m,s)  w.r.t.  S(TU),  where  the  schedule  S 
enforces  the  PSR,  if  Tu1  contains,  for  all  i,  all  and  only  transactions 
t  such  that  I(t)  <  Es'(m)  where  t  e  D , . 

Proof.  Construct  a  time  wall  TW(m,s).  Then  by  the  previous  lemma 
(Lemma  3.1)  we  Know  that  for  any  j,  K,  if  t,  e  D;  and  I(ti)  < 
and  t2  «  Dk  and  I(t2)  -  Esk(m)  then  there  exists  no  t,  -  t2.  Therefore 
by  Property  3.1  above  we  Know  that  Tu1  is  a  consistent  transaction  set 
if  it  contains  for  all  i  only  transactions  t  such  that  I ( t )  <  Es’(m) 
where  t  t  D,.  And  since  Ess(m)  =  m,  we  have  I(t,)  <  n  if  t,  «  Ds. 
Therefore  Tu1  must  be  the  designated  consistent  transaction  set  Tcs 
(m,s).  Q.E.D. 
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Corollary.  Given  a  time  value  m  ana  a  starting  transaction  class  Ts 
,  there  exists  a  designated  consistent  transaction  set  Tcs(m,s). 

Theorem  3. 

Proof.  Partition  Tu  into  Tu1  and  Tu2  such  that  for  all  t  e  D1f  for 
all  i,  t  £  Tu1  iff  I ( t )  <  Es’(m).  Then  it  is  clear  that  dependencies 
induced  by  tR  must  be  arcs  that  go  from  tR  to  transactions  in  Tu1,  and 
arcs  from  transactions  in  Tu2  to  tR.  By  Lemma  3.1,  there  exist  no 
dependencies  from  transactions  in  Tu1  to  those  in  Tu2.  Therefore  arcs 
introduced  by  tR  will  not  introduce  any  cycle  into  the  original  TG(S). 
Since  TG(S)  has  no  cycle,  therefore  TG(S(TU  U  tR))  has  no  cycle.  Q.E.D. 

In  other  words,  if  a  read-only  transaction  reads  the  latest  versions 
of  data  granules  of  data  segment  D,  which  are  right  before  the  time 
indicated  by  the  time  wall  component  Es’(m)  of  certain  time  wall 
TW(m,s),  then  it  is  accessing  a  consistent  database  state  and  will  not 
induce  cycles  into  the  transaction  dependency  graph. 


6.2  CONCURRENCY  CONTROL  PROTOCOL  FOR  READ-ONLY  TRANSACTIONS 


Maxing  use  of  Theorem  3,  a  read-only  transaction  t  that  reads  from 
data  segments  that  do  not  lie  on  one  critical  path  in  DKG  should  be  giv¬ 
en  versions  that  are  the  latest  before  certain  time  wall.  However,  to 
compute  the  time  wall  the  system  has  to  determine  the  starting  data  seg- 
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went  Ds  and  a  starting  time  value  m.  While  the  choice  can  be  arbitrary, 
it  is  theoretically  desirable  that  the  following  criteria  are  met: 

(1)  Es^m)  (for  all  D1  in  the  DSH*)  is  computable  at  l(t),  the  initi¬ 

ation  time  of  the  read-only  transaction. 

(2)  There  exists  no  m'  >  m  such  that  EsUm’)  is  computable  at  I(t) 
for  all  0 i  in  the  DSH*. 

The  first  criterion  stipulates  that  m  should  be  imaJLl  enough  so  that 
all  Es’(m)  is  computable  at  I(t),  therefore  t  potentially  does  not  have 
to  wait  until  a  later  time  to  access  from  certain  segment.  (If  some  Esj 
(m)  is  not  computable  at  I(t),  t  would  have  to  wait  till  a  later  time 
when  it  is  computable  before  accessing  data  from  data  segment  Dj.)  The 
second  criterion  strives  to  achieve  reading  of  the  newe.it  poiiibie  data¬ 
base  state. 

A  compromise  is  struck  here  in  devising  our  protocol  for  read-only 
transaction®.  First,  to  save  computation  time,  a  new  time  wall  is  com¬ 
puted  by  the  system  at  certain  intervals  and  the  new  time  wall  is  're¬ 
leased'  to  ail  read-only  transactions  that  start  before  the  next  vejuton 
of  the  time  wall  is  released  by  the  system.  (That  is,  there  is  no  need 
to  compute  a  time  wall  for  every  read-only  transaction.)  In  computing 
the  next  version  of  the  time  wall,  the  system  can  choose  arbitrarily  a 
starting  data  segment  Ds  which  is  of  one  of  the  lowest  levels  and  choose 
m  to  be  the  initiation  time  of  the  oldest  active  transaction  rooted  in 
Ds.  time.  If  it  encounters  any  C,,ate  function  that  it  cannot  compute, 


it  waits  until  it  becomes  computable.  Eventually  enough  time  will 
elapse  such  that  Ej’tm)  becomes  computable  for  all  D/s.  Then  a  newly 
constructed  time  wall  is  released. 

Let  the  release  time  of  a  time  wall  TW(m,s)  be  denoted  as 
RT(TW(m,s) ) .  Now  we  provide  the  formal  definition  of  the  read-only 
transaction  synchronization  protocol. 

Concurrency  Control  Algorithm  for  Read-Only  Transaction 

For  every  database  read  request  from  a  read-only  transaction  t  for  a 
data  granule  d,  the  following  protocol  is  observed: 

Protocol  R 

Let  d  e  D{.  The  segment  controller  of  D ,  provides  the  version  d°  of  d 
such  that 

TS(d°)  *  Max(TS(dv))  for  all  v  such  that 
TS(dv)  <  Es 1 (m) 

where  RT(TW(m,s))  *  Max(RT(TW) )  for  all  TW  such  that  RT(TW)  <  I(t). 


7.0  IMPLEMENTATION  OF  THE  HTS  CONCURRENCY  CONTROL  ALGORITHM 

I 


7.1  INTRODUCTION  AND  SUMMARY  OF  RESULTS 


In  this  chapter,  an  implementation  of  the  hierarchical  decomposition 
approach  to  concurrency  control  is  described.  Through  this  description, 
the  practicality  of  the  proposed  algorithm  is  demonstrated.  In  our 
design,  we  strive  to  achieve  the  maximum  parallelism  at  the  system 
level,  breaking  down  the  tasks  to  be  performed  on  the  system-level 
resources  into  as  many  parallel  units  as  possible.  Techniques  described 
here  that  pertain  to  implementation  of  multi-version  databases  are  also 
applicable  to  implementing  the  conventional  multi-version  timestamp 
algorithm.  We  will  also  point  out  the  difference  in  implementation 
between  the  HTS  concurrency  control  algorithm  and  the  conventional 
multi-version  timestamp  algorithm. 

There  are  two  major  issues  involved  in  the  implementation  of  the  HTS 
concurrency  control  algorithm.  The  first  issue  is  the  maintenance  of  a 
multi-version  database.  This  is  not  an  issue  exclusive  to  the  HTS  con¬ 
currency  control  algorithm,  but  also  shared  by  the  conventional  MVTS 
algorithm.  It  includes  problems  of  how  the  versions  of  a  data  element 
are  to  be  created,  how  they  are  stored  and  controlled  to  facilitate  rap¬ 


id  accesses,  and  how  they  are  destroyed  to  make  space  for  future 


versions  (i.e.,  garbage  collection).  In  resolving  this  issue,  an  imple¬ 
mentation  is  designed  which  allows  the  garbage  collection  process  to 
operate  in  parallel  with  activities  that  create  and  access  the 
multi-version  database.  In  addition,  we  have  identified  the  operations 
to  be  performed  against  a  multi-version  database  using  conventional  MVTS 
algorithm  to  be  of  two  types:  ATOMIC_COMMIT  and  TIMESTAMPED_READ.  We 
show  that  the  result  of  using  the  HTS  timestamp  algorithm  is  to  allow  a 
third  type  of  operations,  called  the  NO_TRACE_READ  operations,  to 
replace  a  certain  number  of  the  TIMESTAMPED_READ  operations.  We  will  in 
our  description  of  the  implementation  demonstrate  the  fact  that  the 
NO_TRACE_READ  operation  is  allowed  to  proceed  without  ever  having  to  be 
blocked,  while  the  TIMESTAMPEDJREAD  operation  still  faces  the  danger  of 
being  blocked  due  to  contention  for  system-level  resources.  This  result 
serves  to  further  substantiate  the  argument  that  leaving  read  timestamps 
is  a  relatively  expensive  operation,  in  addition  to  its  potential  of 
causing  more  transaction  aborts. 

The  second  issue  involves  the  mechanisms  of  implementing  Protocol  L, 
and  Protocol  H.  Recall  that  in  order  to  use  Protocol  H  to  access  a  data 
element  in  a  higher  data  segment,  a  read  time  ceiling,  computed  by  eval¬ 
uating  an  A  function,  must  be  available.  On  the  other  hand,  in  using 
protocol  L  to  access  a  data  element  in  a  lower  data  segment,  a  timestamp 
for  accessing  that  data  segment  which  is  different  from  the  trans¬ 
action's  own  initiation  timestamp,  is  used  to  synchronize  such  accesses, 
and.  the  use  of  which  will  involve  certain  constraints  to  be  enforced. 


Accomplishing  these  tasks  requires  maintaining  additional  information. 
How  this  additional  information  is  created,  stored,  accessed  and 
destroyed  is  the  main  subject  in  discussing  this  second  issue.  In  sum, 
an  implementation  is  identified  which  enables  a  rapid  computation  of  the 
A  function  and  does  not  require  the  computation  process  ever  to  be 
blocked  due  to  concurrent  accesses  to  the  required  information.  This 
feature,  along  with  the  fact  that  multiple  uses  of  the  Protocol  H  by  a 
single  transaction  to  access  data  elements  in  the  same  higher  data  seg¬ 
ment  would  require  the  computation  of  the  A  function  to  be  performed 
only  once,  makes  it  possible  to  efficiently  implement  Protocol  H. 

The  organization  of  this  chapter  is  as  follows.  Section  2  provides 
an  overview  of  the  tasks  to  be  performed  by  the  HTS  concurrency  control 
mechanism.  This  overview  serves  to  identify  the  various  operations  that 
will  be  applied  to  the  multi-version  database  and  other  relevant  infor¬ 
mation.  Section  3  describes  the  implementation  of  the  multi-version 
database  and  how  it  is  used  to  support  concurrent  application  of  the 
three  types  of  database  operations:  ATOMIC_COMMIT,  TIMESTAKPED_READ  and 
NO_TRACE_READ .  It  also  describes  mechanisms  for  garbage  collection. 
Section  4  describes  the  implementation  of  the  transaction  history  infor¬ 
mation  that  is  required  to  facilitate  the  evaluation  of  the  A  function 
and  to  enforce  constraints  necessitated  by  the  psudo  evaluation  of  the  B 


function. 


7.2  OVERVIEW  OF  TASKS  OF  THE  HTS  CONCURRENCY  CONTROL  MECHANISM 


In  this  overview,  we  will  describe  the  implementation  of  the  HTS  con¬ 
currency  control  mechanism  from  three  different  angles:  (1)  The  tasks  of 
the  mechanism  as  seen  by  a  transaction  (i.e.,  the  interface  of  the  con¬ 
currency  control  facility  to  a  transaction,)  (2)  The  modules  of  the  con¬ 
currency  control  mechanism,  and  (3)  The  data  (system-level  resources  or 
database  data)  to  be  accessed  by  these  modules.  This  description  serves 
to  provide  a  perspective  on  issues  involved  in  implementing  the  algo¬ 
rithm  and  motivates  the  detail  description  in  the  subsequent  sections. 

7.2.1  INTERFACE  OF  THE  CONCURRENCY  CONTROL  FACILITY  TO  TRANSACTIONS 


From  the  point  of  view  of  a  transaction,  interactions  with  the  con¬ 
currency  control  facility  take  place  at  three  points:  initiation,  read¬ 
ing  database  data,  and  finishing.  This  is  shown  in  Figure  18.  When  a 
transaction  is  initiated,  INITIATION  must  be  called  to  obtain  an  initi¬ 
ation  timestamp  for  the  transaction.  During  the  execution  of  the  trans¬ 
action,  whenever  the  transaction  requests  to  read  a  data  element  in  the 
database,  the  read  request  must  be  handled  through  a  READ  call  to  the 
concurrency  control  facility.  However,  since  every  uncommitted  trans¬ 
action  is  subject  to  the  possibility  of  user  cancellation  and  system 
abort,  when  the  transaction  performs  a  write  to  the  database  before  it 
is  finished,  it  can  not  directly  write  into  the  database,  but  should 
write  into  its  own  work  space.  This  is  a  standard  technique  used  to 


prevent  the  system  from  cascading  the  effect  of  the  transaction  to  other 


transactions  before  it  is  committed.  When  a  transaction  is  finished 
processing,  its  actual  commit  or  abort  is  then  handled  by  a  FINISH  call 
to  the  concurrency  control  facility,  which  validates  all  the  writes  this 

I 

transaction  has  performed  by  comparing  the  transaction's  timestamps  with 
timestamps  of  data  elements  in  the  database.  If  the  transaction  passes 
the  validation  phase,  it  is  then  committed,  and  all  the  writes  will  be 
performed  in  the  database.  Otherwise  it  is  aborted  and  restarted. 

Note  that  whether  these  'calls'  are  done  through  message  passing  or 
subroutine  calls  is  not  relevant  for  the  purpose  of  our  current  dis¬ 
cussion.  We  assume  that  the  concurrency  control  facility  is  capable  of 
being  executed  by  multiple  processes,  and  the  need  for  mutual  exclusion 
when  multiple  processes  are  in  session  will  be  handled  by  accesses  to 
system-level  semaphores  or  locks  when  it  arises.  Therefore  whether  the 
facility  is  executed  by  the  processes  that  also  execute  the  transactions 
or  it  is  executed  only  by  dedicated  processes  does  not  alter  the  cor¬ 
rectness  of  its  execution.  The  choice  would  depend  strictly  on  the 
nature  of  the  processing  environment. 

Note  also  that  we  do  not  attempt  to  address  separately  the  issue  of 
crash  recovery.  Crash  recovery  can  be  handled  in  the  same  way  as  it  is 
handled  in  other  concurrency  control  methods.  If  the  system  crashes 
during  execution  of  a  transaction,  tnat  transaction  is  restarted  when 


the  system  recovers  without  the  database  being  contaminated.  To  handle 


the  problem  of  system  crashes  during  the  commit  phase  of  a  transaction, 
one  may  resort  to  the  standard  technique  of  two-phase  commit  <3ray78, 
Chan82b>.  In  essence,  two-phase  commit  requires  that  the  commit  phase 
of  a  transaction  be  decomposed  into  two  parts:  pre-commit  and 
post-commit.  During  pre-commit,  the  writes  performed  by  the  transaction 
are  forced  to  a  permanent  storage  (e.g.,  a  log  disk).  If  the  system 
crashes  during  pre-commit,  the  transaction  is  considered  uncommitted  and 
therefore  restarted  when  the  system  recovers.  After  pre-commit  is  com¬ 
pleted,  the  transaction  is  considered  committed  and  undergoes  the 
post-commit  phase  during  which  its  writes  are  actually  performed  in  the 
database.  If  the  system  crashes  during  the  post-commit  phase  of  a 
transaction,  the  transaction  is  recovered  when  the  system  recovers  by 
performing  writes  (i.e.,  redo)  into  the  database  based  on  the  log. 
Since  implementation  of  the  HTS  concurrency  control  algorithm  does  not 
preclude  the  use  of  two-phase  commit  (i.e,  pre-commit  can  be  easily 
integrated  into  the  validation  phase) ,  crash  recovery  is  not  specially 
discussed. 

7.2.2  MODULES  OF  THE  CONCURRENCY  CONTROL  FACILITY 


We  will  now  provide  an  overview  of  the  modules  zo  be  executed  within 
each  of  the  three  different  tasks  of  the  concurrency  control  facility. 

7. 2. 2.1  INITIATION 


-128- 


As  shown  in  Figure  19,  INITIATION  is  fairly  simple.  It  merely  reads 
the  timer  and  returns  a  value  as  the  assigned  timestamp  for  the  trans¬ 
action.  It  is  assumed  that  no  two  readings  of  the  timer  will  result  in 
the  same  time  value.  In  addition,  INITIATION  must  also  record  initi¬ 
ation  of  new  transactions  in  a  transaction  history  table,  which  will  be 
used  by  modules  that  evaluate  the  A  function. 

7. 2. 2. 2  READ 

In  the  conventional  mvts  algorithm,  the  READ  call  to  the  cc  (concur¬ 
rency  control)  facility  will  result  in  a  TIMESTAMPED_READ  operation. 
This  operation  is  passed  with  the  timestamp  of  the  transaction  and  an 
identifier  of  the  data  element  to  be  read  (e.g.,  a  logical  page  id),  and 
is  expected  to  return  the  desired  version,  or  the  (virtual)  address  of 
the  physical  page  that  contains  the  desired  version.  TIMESTAMPED_READ 
performs  the  following  tasks: 

(1)  Decide  which  version  of  the  data  element  is  right  before  the 
timestamp  of  the  transaction; 

(2)  Leave  the  timestamp  of  the  transaction  with  this  version; 

(3)  Return  (the  address  of)  this  desired  version. 

In  the  HTS  timestamp  algorithm,  READ  of  the  CC  facility  will  first 
have  to  determine  whether  the  request  is  a  read  to  the  root  data  segment 
of  the  transaction,  to  a  higher  data  segment,  or  to  a  lower  data 
segment.  As  shown  in  Figure  20,  three  different  modules  are  defined  to 
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handle  these  three  cases.  READ_ROOT  implements  the  read  access  of  Pro¬ 
tocol  E,  which  simply  invokes  TIMES T AMPED_RE AD  with  the  transaction 
timestamp  and  the  data  id.  READ_HIGH  implements  the  read  access  part  of 
Protocol  H.  READ_HIGH  contains  a  sub-module,  COMPUTE_RTC ,  which  evalu¬ 
ates  the  A  function  given  the  timestamp  of  the  transaction  and  the  iden¬ 
tifiers  of  the  root  data  segment  and  the  higher  data  segment  being 
accessed.  (RTC  stands  for  read  time  ceiling.)  Once  the  result  of  the 
evaluation  is  available,  R£AD_HIGH  invokes  NO_TRACE_READ  operation, 
passing  to  it  the  result  of  the  evaluation  as  the  read  time  ceiling,  and 
the  data  id.  NO_TRACE_READ  performs  the  following  tasks: 

(1)  Decide  which  version  of  the  data  element  is  right  before  the  read 

time  ceiling; 

(2)  Return  (the  address  of)  the  desired  version. 

Note  that  the  difference  between  the  TIMES TAMPED_READ  operation  and 
the  NO_TRACE_READ  operation  is  that  the  latter  does  not  have  to  leave  a 
read  timestamp  with  the  version  of  the  data  element  being  accessed. 

The  third  module,  READ_LOW,  handles  the  read  access  part  of  Protocol 
L.  It  contains  a  sub-module  REQUEST_LTS  which  provides  timestamps  that 
accesses  to  lower  data  segments  should  use  for  synchronization.  (LTS 
stands  for  lower  timestamp.)  This  sub-module  essentially  'guesses'  the 
value  of  the  B  function  without  actually  evaluating  it,  and  maintains 
constraints  to  be  enforced  in  order  to  validate  its  guesses.  (Refer  to 
section  5.2  for  details.)  The  value  provided  by  REQUEST_LTS  will  be 
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used  as  the  timestamp  value  to  be  passed  to  TIMESTAKPED_READ,  an  opera¬ 
tion  also  shared  by  the  READ_ROOT  module. 

As  also  shown  in  Figure  20,  three  types  of  data  are  being  maintained 
and  accessed  by  the  modules  implementing  READ.  The  transaction  history 
table  is  read  by  COMPUTE_RTC  to  evaluate  the  A  function.  The 
pseudo-transaction  table  is  updated  by  REQUEST_LTS  to  record  constraints 
to  be  enforced  as  a  result  of  guessing  at  the  values  of  the  B  function. 
This  table  is  also  read  by  COMPUTE_RTC.  Finally,  the  multi-version 
database  is  consulted  by  both  TIMESTAMPED_READ  and  NO_TRACE_READ  to 
determine  the  correct  version  to  be  read  and  the  virtual  address  of  that 
version. 

7. 2. 2. 3  FINISH 

The  tasks  that  the  HTS  algorithm  performs  when  a  transaction  is  fin¬ 
ished  processing  are  very  similar  to  those  performed  by  the  conventional 
MVTS  algorithm.  As  shown  in  Figure  21,  FINISH  in  essence  implements  the 
operation  AT0MIC_C0MMIT.  ATOMIC_COMKIT  consists  of  two  stages.  At  the 
first  stage,  the  module  VALIDATION  is  invoked  which,  for  every  data  ele¬ 
ment  that  the  transaction  has  written,  checks  to  see  if  the  version  to 
be  created  by  this  transaction  has  been  invalidated  by  a  rr^d  timestamp 
on  the  version  immediately  previous  to  the  version  to  be  created.  This 
validation  process  will  result  in  the  transaction  being  considered 
either  aborted  or  committed.  In  the  former  case,  the  second  stage  of 


ATOM I C_COMM I T  will  invoice  the  module  ABORT  to  cause  the  writes  performed 
by  the  transaction  to  be  discarded  and  the  transaction  restarted.  In 
the  latter  case,  it  will  invoke  COMMIT  to  cause  new  versions  to  be  actu¬ 
ally  created  in  the  multi-version  database.  AT0MIC_C0MMIT  is  an  opera¬ 
tion  that  has  to  be  accomplished  atomically,  i.e.,  it  cannot  allow 
writes  that  have  been  validated  to  be  invalidated  before  the  operation 
is  finished. 

An  additional  module  executed  by  FINISH  which  is  unique  to  the  HTS 
algorithm  is  the  COMMIT_TIMESTAMPING  module.  This  module  assigns  a  com¬ 
mit  timestamp  to  the  committed  transaction  and  records  this  information 
in  the  transaction  history  table. 

As  also  shown  in  Figure  21,  the  three  kinds  of  data  accessed  by  READ 
are  also  used  in  FINISH.  Transaction  history  data  is  updated  by 
CCMMIT_TIMESTAMPING  module  to  record  commit  times  of  transactions. 
Pseudo-transaction  table  is  consulted  by  VALIDATION  at  the  end  of  the 
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7.2.3  SHARED  DATA  IN  THE  CONCURRENCY  CONTROL  FACILITY 


Many  modules  of  the  above  three  components  of  the  concurrency  control 
facility,  INITIATION,  READ  and  FINISH,  share  accesses  to  the  three  types 
of  data  relevant  to  the  concurrency  control  facility.  In  the  current 
sub-section,  we  will  analyze  potential  concurrent  accesses  to  the  shared 
data  resources.  This  analysis  is  essential  to  designing  the  data  struc¬ 
tures  and  the  access  procedures  to  these  data  structures  so  as  to  allow 
for  the  maximum  level  of  concurrency  within  the  concurrency  control 
facility. 

From  a  theoretical  point  of  view,  the  three  tasks  of  the  concurrency 
control  algorithm,  INITIATION,  READ,  and  FINISH  (i.e.,  writes)  are  all 
atomic  tasks.  By  an  atomic  task  we  mean  that  the  correctness  of  the 
algorithm  relies  on  the  expectation  that  the  (sub-) process  that  executes 
the  task  is  not  interleaved  with  any  other  process.  However,  since  each 
of  these  tasks  in  reality  will  involve  many  instructions  and  execution 
steps  that  are  potentially  lengthy,  enforcing  atomicity  of  these  tasks 
by  allowing  the  concurrency  control  facility  to  execute  one  task  at  a 
time  would  inevitably  cause  the  facility  to  become  a  bottleneck.  There¬ 
fore  it  is  important  that  these  tasks  be  analyzed  and  their  semantics  be 
understood  so  that  a  design  can  be  achieved  that  allows  as  many  concur¬ 
rent  tasks  to  proceed  as  possible.  This  concept  of  emulating  the  effect 
of  atomic  execution  of  tasks  while  allowing  multiple  tasks  to  proceed  at 
the  same  time  is  analoguous  to  emulating  the  effect  of  atomic  execution 
of  transactions  while  allowing  multiple  transactions  to  proceed  at  the 
same  time.  The  latter  is  achieved  by  the  concurrency  control  facility 


of  a  DBMS,  while  the  former  is  to  be  achieved  through  a  careful  design 
of  the  facility  itself. 

7. 2. 3.1  DESIGNING  SYSTEM  MODULES  TO  MAXIMIZE  CONCURRENCY  -  A  METHODOL¬ 
OGY 

The  methodology  that  we  use  to  achieve  a  design  of  the  data  struc¬ 
tures  and  access  procedures  that  allows  the  maximum  level  of  concurrency 
within  the  concurrency  control  facility  is  as  follows. 

(1)  Obtain  a  list  of  the  types  of  atomic  tasks  that  will  have  to  be 
performed  against  shared  data.  Fcr  each  of  these  types  of  tasks 
analyze  the  semantics  of  the  tasks  and  embark  on  an  attempt  to 
break  the  tasks  down  into  work  units  that  must,  based  on  the 
semantics  of  the  tasks,  be  executed  atomically.  These  work 
units,  referred  to  as  atonic  work  units,  may  be  organized  in  a 
hierarchical  fashion,  in  the  sense  that  one  atomic  work  unit  may 
further  contain  other  atomic  work  units  as  atomic  sub-units. 
The  criterion  for  determining  whether  a  work  unit  is  atomic  is 
whether  the  resources  read  or  written  by  the  work  unit  can  be 
released  when  the  unit  is  finished.  Therefore  if  an  atomic  unit 
contains  other  atomic  work  units  as  sub-units,  then  the 
resources  obtained  and  used  by  the  atomic  sub-units  may  be 
released  when  the  sub-units  are  finished,  while  the  resources 
obtained  and  used  by  the  parent  atomic  unit  can  not  be  released 
until  the  parent  unit  is  finished. 


(2)  Construct  a  data-operation  matrix  where  on  one  dimension  the 
shared  data  sets  are  listed  and  on  the  other  dimension  the  atom¬ 
ic  work  units  are  listed.  For  each  cell  in  the  matrix,  the 
operation  that  the  atomic  unit  will  perform  on  the  data  set  is 
listed.  The  operations  could  toe  'read',  'write'  or  'read  and 
write' .  The  matrix  is  used  to  facilitate  conflict  analysis. 

(3)  Conduct  conflict  analysis  for  each  pair  of  atomic  work  units  to 
identify  potential  conflicting  atomic  work  units.  Two  atomic 
work  units  conflict  if  the  intersection  of  the  data  sets  they 
access  is  not  empty  and  at  least  one  of  the  units  would  have  to 
perform  a  write  on  one  of  the  data  sets  in  the  intersection  set. 
The  result  of  conflict  analysis  is  captured  in  a  conflict  matrix 
where  each  of  the  dimensions  consists  of  all  the  atomic  work 
units. 

(4)  For  each  pair  of  conflicting  atomic  work  units,  examine  the 
nature  of  conflict  and  identify  implementation  methods  that  will 
handle  the  conflict  efficiently. 

Three  methods  for  handling  conflicts  are  identified  and  listed  below; 

(1)  Serialization;  Allow  only  one  of  the  conflicting  pair  of  atomic 
work  units  to  proceed  at  any  time.  This  can  be  achieved  through 
the  use  of  a  single  dedicated  process  to  handle  both  types  of 
atomic  work  units.  This  method  is  effective  when  the  atomic 
work  units  involved  are  very  short. 
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(2)  Mutual  exclusion  on  selected  granules  of  data  sets:  Allow  multi¬ 
ple  work  units  to  proceed  at  the  same  time  but  serialize  those 
that  are  contending  for  the  same  data  granules.  This  is 
achieved  by  associating  with  each  granule  of  data  in  the  data 
set  a  semaphore,  or  a  lock.  The  lock  is  obtained  when  the  data 
granule  is  to  be  operated  on  and  is  released  when  the  atomic 
unit  is  finished.  This  method  is  more  expensive  than  serializa¬ 
tion  because  it  involves  overhead  of  setting,  releasing,  testing 
for  locks,  and  blocking.  However,  it  allows  a  higher  level  of 
concurrency. 

(3)  Semantic  analysis  for  read-write  conflict:  For  those  conflicts 
in  which  one  atomic  unit  only  needs  to  read  the  data  set,  seman¬ 
tic  analysis  may  be  combined  with  careful  implementation  to 
enforce  that  no  writes  onto  the  granules  will  (semantically) 
invalidate  a  previous  read  of  the  granules  by  the  read-only 
atomic  unit  still  in  progress.  This  method,  if  applicable, 
eliminates  the  conflict  without  incurring  the  overhead  of  block¬ 
ing  and  locking. 

7. 2. 3. 2  APPLYING  THE  METHODOLOGY  TO  DESIGNING  CONCURRENCY  CONTROL 
FACILITY 

We  will  apply  the  above  methodology  to  the  implementation  of  the  HTS 
control  algorithm.  The  atomic  work  unit  hierarchy,  data-operation 
matrix  and  the  conflict  matrix  are  shown  in  Figure  22,  Figure  23(a)  and 
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Figure  23(b).  Analying  the  conflict  matrix  leads  to  the  following 
design  decisions: 

(1)  We  would  like  to  have  multiple  NO_TRACE_READ ' s  to  proceed  without 

ever  having  to  be  blocked  or  to  incur  locking  overhead.  Since 
its  conflict  with  ATOMIC_COMMIT  and  TIMESTAMPED_READ  is  of  the 
read-write  type,  method  (3)  listed  above  is  applicable.  We  will 
identify  a  design  of  the  multi-version  data  base  and  a  procedure 
for  accessing  this  data  set  by  all  three  types  of  atomic  work 
units  that  would  achieve  this  goal. 

(2)  We  would  like  to  have  multiple  COMPUTE_RTC ’ s  to  proceed  without 
ever  having  to  be  blocked  or  to  incur  locking  overhead.  Since 
its  conflict  with  REQUEST_LTS,  INITIATION  and 
COMM IT_T IMESTAMP I NG  is  of  the  read-write  type,  method  (3)  listed 
above  is  again  applicable.  A  design  of  the  transaction  history 
table  and  the  pseudo-transaction  table  and  access  procedures  to 
these  data  sets  is  identified  to  achieve  this  goal. 

(3)  Since  it  is  believed  that  INITIATION  and  COMM I T_T I MES TAMPING  are 
shore  atomic  units,  we  will  use  method  (1)  above  to  reconcile 
their  conflicts. 

(4)  All  the  other  conflicts  are  to  be  resolved  through  system-level 
locks.  These  locks  will  be  integrated  in  the  design  of  these 
data  sets. 

7.2.4  SUMMARY 
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In  this  section  we  have  provided  an  overview  of  the  tasks  of  the  con¬ 
currency  control  facility  that  implement  the  HTS  algorithm.  The  over¬ 
view  focuses  on  the  issue  of  how  to  design  the  implementation  in  a  way 
so  as  to  maximize  concurrency  within  the  facility.  A  methodology  for 
analyzing  the  tasks  of  the  facility,  which  is  applicable  to  the  imple¬ 
mentation  of  other  facilities  in  a  DBMS  or  an  operating  system,  is 
presented  and  applied  to  the  current  problem.  Conclusions  are  drawn  to 
guide  the  design  that  are  to  be  presented  in  the  subsequent  sections. 

7.3  IMPLEMENTATION  OF  THE  MULTI-VERSION  DATABASE 

7,3.1  BASIC  CONCEPTS 

In  this  section  we  describe  the  techniques  for  implementing  the 
multiv-version  database  to  be  accessed  by  the  three  major  atomic  work 
units  ATOM I C_COMM I T ,  T I MESTAPMED_READ  and  NO_TRACE_READ  of  the  concur¬ 
rency  control  facility.  (Refer  to  the  conflict  matrix  shown  in  Figure  23 
on  page  141.) 

We  assume  that  all  the  data  elements  in  the  database  that  the  proc¬ 
esses  outside  of  the  concurrency  control  facility  may  issue  reads  and 
writes  to  and  that  are  target  for  control  by  the  CC  facility  are 
fixed-length  logical  gages.  This  assumption,  however,  does  not  compro¬ 
mise  the  generality  of  the  results  in  our  current  design.  In  fact,  any 
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unit  of  data  objects,  such  as  tuples  or  records  of  relations  or  files 


can  be  used  as  the  units  of  data  elements  on  whose  behalf  multiple  ver¬ 
sions  are  to  be  kept,  so  long  as  their  identities  can  be  represented  and 
mapped  easily  onto  a  storage  unit.  (Examples  of  storage  units  are  virtu¬ 
al  addresses  and  physical  pages.) 

A  logical  page  is  identified  by  a  unique  logical  page  id.  Given  a 
database  system,  a  fixed  number  of  logical  pages  are  allocated.  All 
these  pages  are  assumed  to  contain  some  information  which  may  be  user 
data,  system  data  or  simply  information  that  indicates  that  the  page  is 
available  for  storing  new  information.  The  semantics  of  the  content  of 
the  logical  pages  are  of  no  concern  to  the  concurrency  control  facility. 

The  concurrency  control  facility  maintains  multiple  versions  of  these 
logical  pages.  However,  the  fact  that  multiple  versions  exist  for  each 
logical  page  is  transparent  to  all  processes  outside  of  the  concurrency 
control  facility.  Within  the  CC  facility,  versions  of  logical  pages  are 
assigned  to  physical  pages,  each  of  which  corresponds  to  a  virtual  memo¬ 
ry  page  and  is  identified  by  the  virtual  memory  address. 

7.3.2  STRATEGIES  FOR  IMPLEMENTING  MULTIPLE  VERSIONS 

The  analysis  presented  in  the  previous  section  shows  that  within  the 
CC  facility  two  major  types  of  accesses  to  the  MVDB  are  the  following: 
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(1)  Given  a  logical  page  id,  find  the  physical  page  that  contains  the 

most  recent  version  of  the  logical  page. 

(2)  Given  a  logical  page  id  and  a  time  ceiling,  find  the  physical 
page  that  contains  the  most  recent  version  of  the  logical  page 
prior  to  the  given  time  ceiling. 

The  MVDB  must  be  implemented  to  facilitate  fast  responses  to  these 
types  of  accesses.  To  accomplish  the  first,  we  will  allow  direct 
accesses  to  the  most  recent  version  of  a  logical  page.  To  accomplish 
the  second,  we  will  allow  the  versions  of  a  logical  page  be  chained 
together  in  a  reversed  chronological  order  so  that  a  request  to  a  ver¬ 
sion  subject  to  some  rime  ceiling  can  be  achieved  through  a  direct 
access  to  the  most  recent  version  and  then  traversing  the  chain  until  a 
version  with  a  version  timestamp  smaller  than  the  given  time  ceiling  is 
found.  Therefore  the  guidelines  for  the  implementation  are  the  follow¬ 
ing: 

(1)  Direct  access  to  the  most  recent  version  of  a  logical  page. 

(2)  Versions  of  a  logical  page  are  to  be  chained  in  a  reversed  chron¬ 

ological  order. 

There  are  two  strategies  for  implementing  direct  accesses  to  the  most 
recent  versions.  The  first  one  is  physical  clustering,  namely,  reserv¬ 
ing  a  continuous  blocic  of  physical  pages  to  store  the  most  recent  ver¬ 
sions  of  all  the  logical  pages  so  that  giver,  a  logical  page  id  the 
virtual  address  of  the  physical  page  that  stores  the  most  recent  version 
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of  that  logical  page  can  be  directly  computed  from  that  logical  page  id. 
The  second  one  is  through  the  use  of  a  scatter  table.  Given  a  logical 
page  id,  the  address  of  the  entry  that  corresponds  to  the  logical  page 
in  the  scatter  table  is  directly  computable  from  the  logical  page  id, 
while  the  virtual  address  of  the  physical  page  that  contains  that  ver¬ 
sion  is  stored  in  the  scatter  table  entry.  The  physical  clustering 
method  is  shown  in  Figure  24,  and  the  scatter  table  method  is  shown  in 
Figure  25. 


The  advantages  and  disadvantages  of  using  the  scatter  table  method 
versus  the  physical  clustering  method  are  listed  below: 

Advantages : 

(i)  Avoid  the  need  of  copying  versions  multiple  times:  The  need  for 
atomic  commit  does  not  allow  new  versions  of  logical  pages  cre¬ 
ated  by  a  transaction  (i.e.,  writes  on  these  logical  pages  by 
that  transaction)  to  be  made  Known  to  the  database  until  it  is 
decided  that  the  transaction  will  commit.  This  amounts  to 
requiring  that  new  versions  be  first  created  in  a  region  of 
physical  pages  assigned  to  be  the  transaction's  own  work  space. 
Under  this  circumstance,  if  physical  clustering  is  used  to 
implement  the  most  recent  versions,  then  most  of  the  new  ver¬ 
sions  will  have  to  be  written  twice  every  time  it  is  created: 
once  in  the  p -ivate  worn  space  of  the  transaction  that  creates 
it  and  once  in  the  most  recent  versions’  cluster  region.  More¬ 
over,  when  a  version  is  superceded  by  a  more  recent  version,  it 


has  to  be  forced  out  of  the  most  recent  versions'  cluster  region 


and  be  copied  into  an  area  for  older  versions.  However,  if  the 
scatter  table  method  is  used,  the  physical  pages  in  the  private 
work  space  where  the  new  versions  were  first  created  can  be 
directly  pointed  to  by  the  scatter  table  when  the  transaction 
commits,  thereby  avoiding  the  overhead  of  writing  the  new  ver¬ 
sions  twice.  Similar  arguments  persist  when  a  version  is  super- 
ceded  and  must  be  forced  out  of  the  most  recent  versions' 
region. 

(2)  Higher  level  of  spatial  locality  when  traversing  the  version 
chain:  Since  scatter  table  entries  are  likely  to  be  much  small¬ 
er  in  size  than  the  logical  or  physical  pages,  less  space  (i.e., 
less  number  of  physical  pages)  is  required  to  complete  the  paths 
of  version  chains.  Traversing  a  version  chain  to  find  a  desired 
version  can  be  accomplished  by  traversing  scatter  table  entries 
rather  than  by  actually  visiting  individual  physical  pages  that 
store  the  intermediate  versions  of  the  logical  page.  Spatial 
locality  is  therefore  likely  to  be  enhanced  during  activities  cf 
version  chain  traversals.  (Higher  spatial  locality  is  desirable 
because  it  potentially  enhances  the_perf ormance  of  the  virtual 
storage  system.) 

Disadvantages: 

(i)  Avoid  the  overbad  cf  maintaining  and  backing  up  the  scatter 
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(2)  Higher  level  of  spatial  locality  in  the  most  recent  versions' 
region:  Since  the  most  recent  versions  are  more  likely  to  be 

the  desired  versions  that  satisfy  most  access  requests,  and 
assuming  that  spatial  locality  exists  at  the  logical  page  level 
(i.e.,  contiguous  logical  pages  have  a  higher  probability  of 
being  accessed  together,)  it  is  expected  that  spatial  locality 
at  the  physical  page  level  can  be  enhanced  if  the  most  recent 
versions  are  also  physically  clustered. 

In  our  current  design,  we  have  chosen  to  use  the  scatter  table  tech¬ 
nique  for  implementation.  The  remainder  of  this  section  describes  how 
operations  on  the  multi-version  database  are  accomplished  in  this 
design.  However,  this  choice  does  not  compromise  the  generality  of  the 
main  results  we  will  obtain  for  the  implementation.  A  design  that  has 
similar  attributes  in  conflict  handling  but  uses  the  physical  clustering 
technique  can  be  analoguously  obtained. 

7.3.3  DATA  STRUCTURES 

The  data  structure  of  the  scatter  table  is  shown  in  Figure  26.  The 
table  is  composed  of  two  parts.  The  first  part,  called  the  most  recent 
version  region,  stores  the  entries  for  the  most  recent  versions.  The 
second  part,  called  the  old  version  region,  stores  the  entries  for  the 
older  versions.  The  number  of  entries  in  the  first  part  is  determined 
by  the  number  of  logical  pages  allocated  in  the  system,  while  in  the 


second  it  is  a  function  of  what  is  considered  adequate  for  holding  all 
the  older  versions  that  might  still  be  accessed  at  any  time. 

For  the  purpose  of  garbage  collection,  the  old  version  region  is 
arranged  as  a  ring  buffer,  with  two  pointers  FREE_S P ACE_  BEG I N  and 
FREE_SPACE_END  pointing  to  the  beginning  of  the  free  entries  and  the  end 
of  the  free  entries.  When  FREE_SPACE_BEGIN  catches  up  with 
FREE_SPACE_END ,  the  scatter  table  for  the  older  versions  is  full  and  the 
system  must  wait  till  garbage  collection  to  advance  the  FREE_SPACE_END 
pointer  before  it  can  further  allow  most  recent  versions  to  be  super- 
ceded.  We  will  design  the  garbage  collection  process  to  be  a  parallel 
process  that  is  constantly  checking  to  get  rid  of  stale  versions. 

As  shown  in  Figure  26,  each  scatter  table  entry  contains  (l)  the  ver¬ 
sion  timestamp  (VTS) ,  i.e.,  the  timestamp  used  by  the  transaction  to 
create  this  version,  (2)  the  read  timestamp  (RTS),  i.e.,  the  largest 
timestamp  of  the  transactions  that  read  this  version,  (3)  the  pointer  to 
the  scatter  table  entry  of  the  immediately  previous  version  (PPT) ,  and 
(4)  the  virtual  address  of  the  physical  page  that  stores  this  version 
(VA) .  For  an  entry  in  the  old  version  region  of  the  scatter  table,  the 
logical  page  id  whose  old  version  the  entry  is  corresponding  to  is  also 
stored  with  the  entry. 

7.3.4  OPERATIONS  AND  SYNCHRONIZATION 


We  will  now  analyze  the  operations  to  be  performed  on  the  scatter 
table  by  the  three  atomic  work  units:  ATOMIC_COMMIT,  TIMESTAMPEDJREAD 
and  NO_TRACE_READ.  The  purpose  is  to  identify  the  most  efficient  way  of 
controlling  concurrent  accesses  to  the  scatter  table  while  maintaining 
the  correctness  of  each  operation. 


7. 3. 4.1  ATOMIC  COMMIT 


The  task  of  ATOMIC_COMMIT  for  a  transaction  with  a  timestamp  TS  is  as 
follows.  (We  assume  that,  for  practicality,  only  the  most  recent  ver¬ 
sions  can  be  superceded  by  a  new  version.)  For  every  new  version  to  be 
created,  the  scatter  table  entry  for  the  most  recent  version  is  located 
and  the  RTS  field  of  that  entry  is  compared  with  TS  and  the  entry  is 
validated  if  RTS  i  TS  and  VTS  S  TS.  If  any  entry  cannot  be  validated, 
the  transaction  is  aborted.  If  all  entries  are  validated,  then  the 
transaction  commits.  When  the  transaction  commits,  for  every  new  ver¬ 
sion  it  creates,  a  new  entry  in  the  scatter  table  must  be  created  and 
placed  in  the  proper  location  in  the  version  chain.  This  can  be  accom¬ 
plished  by,  for  each  new  version  to  be  created,  obtaining  a  free  entry 
space  in  the  Old  Version  Region  and  copying  the  entry  for  the  immediate¬ 
ly  previous  version  to  the  free  entry  space.  The  entry  for  the  new 
version  can  then  be  placed  in  the  entry  space  originally  occupied  by  the 
entry  for  the  immediately  previous  version.  This  process  of  inserting 


entries  for  new  versions  m  the  scatter  table  is  shown  in  Figure  27. 


The  conflict  among  AT0MIC_C0MMIT  processes  arise  from  version  chain 
insertions.  If  two  AT0MIC_C0MMIT  processes  are  run  in  parallel  and  both 
of  them  need  to  insert  a  new  version  of  the  same  logical  page  then  use 
of  the  above  insertion  procedure  will  lead  to  inconsistency.  The  incon¬ 
sistency  is  caused  by  the  invalidation  of  a  validated  entry  by  concur¬ 
rent  processes.  To  prevent  this  situation,  a  lock  bit,  called 
ENTRY_LOCK,  is  associated  with  each  entry  which  must  be  acquired  and 
locked  before  an  entry  is  examined  for  validation.  The  lock  will  be 
held  until  the  transaction  is  commited  and  the  insertion  procedure  for 
that  logical  page  is  finished,  or  until  the  transaction  aborts. 

7. 3. 4. 2  TIMESTAMPED_READ 

The  task  for  TIMESTAMPED_READ  is  simpler.  Given  a  read  request  to  a 
logical  page  and  a  timestamp  ceiling  TS,  the  version  chain  in  the  scat¬ 
ter  table  for  that  logical  page  is  traversed  until  the  entry  for  the 
version  with  a  version  timestamp  £  TS  is  located.  The  RTS  field  of  that 
entry  is  then  updated  to  be  the  maximum  of  its  original  value  and  TS. 
The  address  (i.e.,  VA  of  the  entry)  is  then  returned. 

The  conflict  between  TIMESTAMPED_READ  and  ATOMIC_COMMIT  lies  with 
accesses  to  the  RTS  field  of  an  entry.  Updating  RTS  of  an  entry  already 
validated  by  a  concurrent  ATOMIC_COMMIT  process  could  result  in  the 
invalidation  of  the  entry  for  the  ATOMIC_COMMIT  process.  To  prevent 
such  interference,  TIMESTAMPED_READ  is  required  to  acquire  ENTRY_LOCK 
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associated  with  an  entry  before  it  reads  it,  and  will  immediately 
release  it  after  it  is  determined  that  this  is  not  the  version  needed 
or,  if  it  is,  after  RTS  is  updated. 


The  conflict  among  TIMESTAMPED_READ' s  also  arises  from  conflicting 
accesses  to  the  RTS  field,  and  therefore  can  be  handled  by  the  above 
ENTRY_LOCK  protocol. 


7. 3. 4. 3  NO  TRACE  READ 


The  task  of  NO_TRACE_READ  is  very  similar  to  that  of 
TIMESTAMPED_READ.  The  only  difference  is  that  the  former  does  not 
require  the  update  of  the  RTS  field  of  the  entry  for  the  version  to  be 
read.  Since  the  logic  of  NO_TRACE_READ  is  not  concerned  with  RTS,  the 
only  conflict  is  between  NO_TRACE_READ  and  ATOMIC_COMKIT  when  the  latter 
inserts  entries  in  the  version  chain  that  the  former  is  following.  How¬ 
ever,  the  following  two  facts  enable  NO_TRACE_READ  processes  to  proceed 
without  ever  having  to  be  concerned  with  concurrent  ATOMlC_COHKXT  proc¬ 
esses  that  operate  on  the  version  chain  of  the  same  logical  page: 

(1)  If  entry  el  is  locked  by  a  concurrent  ATOMIC_COKMIT  process,  then 
the  version  to  be  inserted  by  that  process  cannot  be  the  target 
of  any  concurrent  NO_TRACE_READ  process.  This  is  because  the 
time  ceiling  that  a  NO_TRACE_READ  process  uses  is  always  smaller 
than  the  timestamps  of  any  active  transactions  that  might  write 
on  that  logical  page.  (Refer  to  the  definition  of  Protocol  H.) 
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(2)  Reading  a  locked  entry  el  by  a  NO_TRACE_REAB  process  will  not  put 
the  NO_TRACE_READ  process  in  danger  of  reading  a  distorted  ver¬ 
sion  chain  which  may  lead  to  inconsistency.  This  can  be  read 
from  Figure  27  on  page  154  which  shows  the  possible  states  that 
a  NO_TRACE_READ  may  encounter  when  retrieving  a  version  of  a 
data  element  while  a  concurrent  process  is  inserting  a  new  ver¬ 
sion.  It  is  shown  that  NO_TRACE_READ  would  not  be  misled  even 
if  it  reads  the  intermediate  state  produced  by  the  concurrent 
insertion  process. 

Therefore  it  is  concluded  that  a  NO_TRACE_READ  process  can  proceed 
without  ever  interfering  with  any  other  concurrent  process,  and  there¬ 
fore  never  has  to  wait  or  obey  other  synchronization  protocol. 

7.3.5  GARBAGE  COLLECTION 

The  garbage  collection  process  is  responsible  for  deleting  stale  ver¬ 
sions  and  advancing  the  FREE_SPACE_END  pointer  of  the  scatter  table. 
Also,  whenever  an  old  version  is  deleted,  the  physical  page  that  stores 
that  old  version  is  returned  to  the  free  page  list  available  for  allo¬ 
cating  to  transactions  as  work  space.  What  constitute  stale  versions 
and  how  the  garbage  collection  process  operates  on  the  scatter  table  are 
the  subjects  of  the  current  discussion. 


To  determine  old  versions  that  are  safe  for  deletion,  the  garbage 
collection  process  can  maice  use  of  the  time  wall  concept  related  to  the 
Read-Only  Protocol.  Recall  that  to  enable  a, read-only  transaction  to 
access  any  data  segment  without  having  to  leave  timestamps,  be  delayed 
or  be  aborted,  a  time  wall,  which  is  a  set  of  times  where  each  time  is 
associated  with  a  data  segment,  is  computed  and  the  read-only  trans¬ 
action  is  allowed  to  access  data  versions  that  are  the  most  recent 
versions  immediately  prior  to  the  times  in  the  time  wall.  For  example, 
to  access  data  element  d  in  a  data  segment  D,,  and  given  that  the  time 
associated  with  D<  in  the  current  time  wall  is  timeif  the  version  of  d 
which  is  the  most  recent  version  before  time  ,  is  granted  for  access  by 
the  transaction. 

It  can  be  shown  that  the  time  wall,  computed  periodically,  estab¬ 
lishes  a  set  of  times  that  can  be  used  to  determine  which  older  versions 
will  never  be  accessed  by  active  or  future  transactions,  and  therefore 
can  be  garbage  collected.  We  will  first  re-state  the  following  fact: 

Fact:  Given  a  time  wall  TW  computed  and  released  at  timeo  by  the  pro¬ 
cedure  described  in  Section  6.2,  and  let  TW  =  <time1,  time2, 
...,  timeri>,  where  time  ,  is  the  time  component  in  TW  associated 
with  data  segment  D , ,  then  no  update  transactions  rooted  in  the 
data  segment  Dif  for  i  =  1  to  n,  that  started  before  time,  are 
still  active  at  time  time0. 


The  above  statement  confirms  the  fact  that  the  version  of  a  data  ele¬ 
ment  in  data  segment  D<  that  has  a  successor  version  which  is  prior  to 
the  time  wall  component  time <  in  the  currently  applicalbe  time  wall  will 
never  be  requested  by  any  active  and  future  transaction.  Therefore  this 
version  is  a  stale  version  and  can  be  garbage  collected. 

Armed  with  the  above  fact,  the  garbage  collection  process  is  a  system 
process  which  is  invoked  whenver  a  new  time  wall  is  released.  For  every 
data  element  in  a  data  segment  Di,  the  process  traverses  the  version 
chain  of  that  data  element  until  a  version  is  found  which  has  a  version 
timestamp  prior  to  time*  in  the  time  wall.  Then  it  deletes  all  versions 
of  that  data  element  prior  to  this  version,  if  there  is  any. 

To  enable  the  garbage  collection  to  be  more  efficient,  the  logical 
page  id  field  of  the  scatter  table  entries  in  the  Old  Version  Region  of 
the  the  scatter  table  can  be  used  to  select  the  logical  pages  whose  ver¬ 
sion  chains  are  to  be  examined.  The  garbage  collection  process  can  look 
into  only  scatter  table  entries  in  the  Old  Versions  Region  and  for  each 
entry  follow  the  version  chain  to  see  if  there  exists  a  successor  ver¬ 
sion  which  is  prior  to  the  specified  time  value  in  the  time  wall.  Given 
the  fact  that  at  any  given  time  most  data  elements  would  probably  have 
only  one  version  in  the  database,  selectively  chosing  version  chains  to 
be  examined  would  greatly  reduce  the  work  of  the  garbage  collection 
process  by  eliminating  the  need  for  it  to  look  at  all  the  data  elements, 
but  only  those  with  more  than  one  versions. 


The  Garbage  Collection  process  would  start  at  the  entry  pointed  to  by 
FREE_ENTRY_END  +  1  in  the  Old  Version  Region  of  the  scatter  table  and 

examine  that  entry  according  to  the  procedure  described  in  the  previous 
paragraph.  Once  it  is  determined  that  the  entry  can  be  garbage  col¬ 
lected,  the  entry  is  marked  free  (e.g.,  setting  a  bit  in  the  logical 
page  id  field)  and  FREE_ENTRY_END  is  incremented.  This  proceeds  until 
either  an  entry  is  found  that  is  not  eligible  for  being  garbage  col¬ 
lected,  or  when  FREE_ENTRY_END  +1  is  equated  with  FREE_ENTR Y_BEG I N  or  it 
is  already  marked  free.  The  Garbage  Collection  process  can  then  be  dor¬ 
mant  till  the  next  time  wall  is  released. 

Assuming  that  the  operations  of  incrementing  FREE_ENTRY_END  and 
FREE_ENTRY_BEGIN  are  executed  atomically,  the  garbage  collection  process 
will  never  interfere  adversely  with  any  other  concurrent  processes. 
While  it  is  in  no  conflict  at  all  with  T I MESTAMPED_READ  and 
NO_TRACE_READ ,  it  can  also  be  shown  that  it  does  not  have  to  interfere 
with  concurrent  insertion  operations  by  ATOMIC_COMMIT.  We  give  the  fol¬ 
lowing  facts  to  substantiate  this  statement: 

(1)  ATOM I C_COMM I T  will  never  cause  a  version  chain  that  the  Garbage 
Collection  process  must  follow  to  be  distorted.  This  has  been 
shown  in  discussing  NO_TRACE_REAE  operations. 

(2)  Garbage_Collection  will  never  garbage  collect  an  entry  that  an 
ATOM I C_COMM I T  process  has  just  obtained  from  the  free  entry  list 
(therefore  in  the  domain  of  operation  of  the  garbage  collection 


process)  but  has  not  written  the  new  data  in.  This  is  because 


such  entries  would  have  been  marked  free. 


7 . 3 .6  SUMMARY  OF  THE  MULTI-VERSION  DATABASE  IMPLEMENTATION 


To  summarize  our  implementation  of  the  Multi-versin  Database  in  the 
concurrency  control  facility,  we  emphasize  that  while  ATOMIC_COMMIT  and 
TIMESTAMPEDJ?EAD  must  be  synchronized  by  the  ENTRY_LOCK,  NO_TRACE_READ 
and  GARBAGE_COLLECTION  do  not  have  to  use  any  lock  or  otherwise  obey 
synchronization  protocols,  and  therefore  can  always  proceed  without 
being  blocked  or  incurring  overhead  for  other  processes. 


7.4  IMPLEMENTING  FUNCTIONS  FOR  MANAGING  AND  COMPUTING  TIMESTAMPS 


7.4.1  INTRODUCTION  AND  OVERVIEW 


In  this  section  we  describe  how  timestamps  are  assigned  or  computed. 
Recall  that  in  order  to  make  use  of  Protocol  H  or  Protocol  L  for  access¬ 
ing  data  elements  outside  of  a  transaction's  own  root  data  segment,  it 
is  necessary  to  compute  an  access  timestamp  which  is  different  from  the 
transaction's  own  initiation  timestamp.  Computation  of  these  timestamps 
results  in  the  need  to  maintain  information  on  transaction  history  as 
well  as  pseudo-transaction  constraints.  Maintenance  of  these  informa¬ 
tion  is  also  the  subject  of  the  current  section. 
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As  shown  in  Figure  23  cn  page  141,  four  casks  (atomic  work  units)  are 
related  to  timestamp  management:  INITIATION,  COMM IT_T IKES TAMPING,  COM- 
PUTE_RTC  and  REQUEST_LTS .  The  first  three  share  accesses  to  the  Trans¬ 
action  Table,  while  the  latter  two  share  accesses  to  the 
Pseudo-transaction  Table,  Both  tables  are  maintained  strictly  for  the 
purpose  of  evaluating  the  A  function  to  provide  time  ceilings  used  in 
the  H  protocol.  We  will  therefore  first  describe  in  the  following 
sub-section  the  algorithm  of  COM?UTE_RTC  that  evaluates  the  A  function. 
The  need  for  a  fast  evaluation  algorithm  leads  to  the  requirement  that 
the  above  information  be  presented  to  the  COMPUTE_RTC  module  in  a  struc¬ 
ture  that  is  most  suitable  for  the  algorithm  and  the  structure  can  be 
accessed  by  COMPUTE_RTC  without  being  interfered  by  concurrent  updates 
to  the  structure.  A  design  that  accomplishes  this  goal,  and  involves 
new  tables  compiled  from  the  Transaction  Table  and  the 
Pseudo-transaction  Table  for  access  by  COMPUTE_RTC,  is  described  in  the 
final  sub-section. 

7.4.2  EVALUATING  THE  'A'  FUNCTION 

When  a  transaction  rooted  in  data  segment  D,  needs  to  access  data 
elements  m  a  higher  data  segment  Dj,  it  uses  the  H  protocol  and 
accesses  versions  of  these  data  elements  in  Dj  that  are  immediately  pre¬ 
vious  to  a  read  time  ceiling  (RTC)  computed  by  evaluating  the  function 
A,J(TS),  where  TS  is  the  initiation  timestamp  cf  the  transaction.  COM¬ 
PUTE  RTC  is  a  module  which,  when  invoked,  is  given  the  source  data 


segment  (e.g.,  D , ) ,  the  target  data  segment  (e.g.,  Pj )  and  the  trans¬ 
action  timestamp  (e.g.,  TS) ,  and  will  return  the  value  of  AjMTS). 

There  are  two  strategies  concerning  wnen  to  evaluate  the  A  function. 
One  is  the  static  strategy,  which  computes  at  the  initiation  stage  of  a 
transaction  all  the  read  time  ceilings  the  transaction  will  potentially 
use,  and  stores  these  values  in  the  transaction  header  contained  in  the 
transaction's  work  space.  The  other  is  the  dynamic  strategy,  which  com¬ 
putes  the  read  time  ceiling  only  when  a  read  access  that  requires  it  is 
issued  by  the  transaction.  Once  computed,  this  time  ceiling  is  then 
stored  in  the  transaction  header  for  possible  future  uses.  Which  strat¬ 
egy  is  to  be  preferred  depends  on  the  variability  of  the  sets  of  higher 
data  segments  the  transactions  in  the  same  class  will  access.  However, 
regardless  of  the  timing  strategy,  the  basic  function  of  the  module  COM- 
?UTE_RTC  remains  the  same. 

The  function  A,J(TS)  is  evaluated  by  the  application  of  a  series  of 
the  I_OLD  function  along  the  critical  path  from  the  source  data  segment 
D i  to  the  target  data  segment  Dj.  COMPUTE_RTC(i, j ,TS)  can  be  imple¬ 
mented  recursively.  At  each  step,  the  parent  data  segment,  say  Dn,  of 
the  current  source  data  segment  is  found.  An  I_OLD  function  is  invoked 
which  finds  the  initiation  time  of  the  oldest  active  transaction  at  time 
TS  in  that  parent  data  segment  Dh.  This  initiation  time  is  then  passed 
as  new  TS  to  the  next  invocation  of  COMPUTE_RTC  along  with  Dh  as  the  new 
source  data  segment.  The  procedure  COMPUTE_RTC  therefore  repeately 


calls  itself  until  the  source  data  segment  and  the  target  data  segment 
in  the  call  coincide.  The  parent  of  a  data  segment  along  the  critical 
path  to  the  target  data  segment  is  obtained  from  a  table  PARENT,  which 
is  a  static  table  derived  from  the  data  segment  hierarchy. 

An  efficient  implementation  of  COMPUTE_RTC  depends  on  an  efficient 
algorithm  for  finding  the  initiation  time  of  the  oldest  active  trans¬ 
action  at  a  given  time.  Due  to  difficulties  involved  in  concurrent 
accesses  to  the  Transaction  Table  and  the  Pseudo-transaction  Table,  two 
new  tables,  called  the  Summary  Tables,  are  derived  from  them  for  use  by 
COMPUTE_RTC.  The  data  structures  and  operations  on  these  tables,  and 
how  fast  evaluation  of  I_OLD  can  be  achieved,  is  presented  in  the  fol¬ 
lowing  sub-section. 

7.4.3  DATA  STRUCTURES  AND  OPERATIONS 


In  this  sub-section  we  describe  the  data  structures  of  the  four 
tables  involved  in  timestamp  managment  and  then  discuss  how  concurrent 
accesses,  updates  and  garbage  collection  are  achieved. 

The  four  tables  are  the  Transaction  Table  (TT) ,  the  Transaction  Sum¬ 
mary  Table  (TT_SUM)  which  is  derived  from  TT,  the  Pseudo-transaction 
Table  (PT)  and  the  Pseudo-transaction  Summary  Table  (PT_SUM)  which  is 
derived  from  PT.  Each  data  segment  has  a  set  of  these  tables  associated 
with  it.  They  are  all  implemented  as  ring  buffers  with  associated 
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FREE_SPACE_BEGIN  and  FREE_SPACE_END  pointers.  Each  entry  in  the  tables 
is  composed  of  two  fields:  Initiation  time  (IT)  and  Commit  time  (CT) 
corresponding  to  a  particular  transaction  or  psuedo  transaction. 

7. 4. 3.1  THE  TRANSACTION  TABLE  (TT) 

The  Transaction  Table  is  merely  a  running  log  of  initiation  times  of 
recent  transactions.  For  each  incoming  transaction,  INITIATION  provides 
an  initiation  timestamp  and  creates  a  new  entry  at  the  bottom  of  TT  with 
the  entry's  commit  time  field  set  to  a  large  value.  Therefore  TT  is 
always  ordered  by  the  initiation  time  field.  When  a  transaction  is  fin¬ 
ished  with  its  commit  process,  COMM I T_T I M ES TAMP I N G  provides  a  commit 
timestamp  and  locates  the  transaction's  entry  in  TT  and  updates  the  com¬ 
mit  time  field.  As  shown  in  Figure  23,  if  the  committing  transaction  is 
the  oldest  active  transaction  at  the  time  of  its  commitment,  then 
COMM I T_T I MES TAMPING  removes  the  entries  at.  the  top  of  TT  (by  advancing 
the  FREE_SPACE_END  pointer)  until  the  new  top  entry  is  the  next  oldest 
uncommitted  transaction  in  the  table.  It  also  then  creates  appropriate 
entries  in  the  Transaction  Summary  Table.  Since  the  entry  at  the  top  of 
TT  is  always  the  current  oldest  active  transaction,  whether  a  committing 
transaction  should  be  inserted  into  the  Summary  Table  can  be  easily 
decided  by  the  COMM I T_T IMESTAMP I NG  process. 

7.4. 3.2  THE  TRANSACTION  SUMMARY  TABLE  (TT  SUM) 
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The  Transaction  Summary  Table  contains  the  initiation  times  and  the 
commit  times  of  transactions  that  are  'time-dominant'.  A  time-dominant 
transaction  is  one  which  at  some  point  in  time  has  been  the  oldest 
active  transaction  in  a  data  segment.  Entries  are  inserted  in  TT_SUM 
when  a  time-dominant  transaction  commits.  Since  time-dominant  trans¬ 
actions  will  commit  in  the  order  of  their  initiation  times,  the 
insertion  procedure  ensures  that  TT_SUM  is  ordered  by  both  the  initi¬ 
ation  time  and  the  commit  time  fields. 

As  discussed  under  The  Transaction  Table,  insertion  of  entries  into 
TT  is  performed  by  COMM I T_T IMEST AMP I NG .  Every  time  COMMIT_TIMESTAMPING 
commits  a  transaction  that  is  at  the  top  of  the  Transaction  Table,  that 
transaction ' s  commit  timestamp  is  inserted  and  a  new  entry  with  the  ini¬ 
tiation  time  field  set  to  be  the  initiation  time  of  the  next  oldest 
active  transaction  is  created. 

7. 4. 3. 3  PSEUDO-TRANSACTION  TABLE  (PT1 

Whenever  the  module  REQUEST_LTS  is  invoked  and  passed  with  a  source 
data  segment  D,,  a  target  data  segment  DJf  and  a  timestamp  TS, 
REQUEST_LTS  will  guess  at  the  value  of  B,J(TS)  and  return  it.  However, 
enforcement  of  the  guessed  value  involves  pseudo  transactions  to  be  cre¬ 
ated  in  the  transaction  classes  rooted  in  beweer.  D,  and  D,  whose  sole 
function  is  to  influence  timestamp  management.  Therefore  REQUEST_LTS 
must  insert  the  initiation  time  and  the  commit  time  of  a  pseudo  trans- 


action  associated  with  a  data  segment  in  the  Pseudo-transaction  Table  of 


that  data  segment  for  use  by  other  timestamp  management  modules. 

PT  is  to  be  ordered  by  initiation  times.  Since  pseudo  transactions 
may  not  arrive  in  that  order,  REQUEST_LTS  may  have  to  shift  entries  when 
inserting  a  new  entry. 

Entries  in  PT  are  removed  by  the  INITIATION  process.  As  shown  in  fig 
Figure  29,  when  a  real  transaction  is  initiated,  the  INITIATION  process 
also  looks  at  the  top  of  the  Pseudo-transaction  Table  and  removes  all 
entries  in  the  latter  that  have  initiation  times  smaller  than  the  newly 
initiated  transaction.  (This  can  be  done  because  pseudo  transactions 
are  always  assigned  an  initiation  timestamp  later  than  the  time  when  it 
is  created.)  These  entries  are  then  inserted  into  the 
Pseudo-transaction  Summary  Table  to  be  described  below. 

7. 4. 3. 4  PSEUDO-TRANSACTION  SUMMARY  TABLE  (?T_SUM) 

The  Pseudo-transaction  Summary  Table  contains  the  initiation  and  com¬ 
mit  times  of  pseudo  transactions  that  have  become  current  enough  for  use 
by  COMPUTE_RTC.  An  entry  in  the  Pseudo-transaction  Table  becomes  cur¬ 
rent  when  a  real  transaction  with  an  initiation  time  greater  than  that 
of  the  former  has  been  initiated.  Therefore  inserting  entries  into 
PT_SUM  is  performed  by  the  INITIATION  process. 
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Like  the  Transaction  Summary  Table,  the  semantics  of  the  insertion 


process  enables  PT_SUM  to  be  ordered  by  both  the  initiation  times  and 
the  commit  times.  These  two  summary  tables  are  therefore  in  a  desired 
form  to  be  presented  to  COMPUTE  RTC. 


7. 4. 3. 5  USING  THE  TIMESTAMP  MANAGMENT  TABLES 


The  conflict  matrix  among  the  four  operations  INITIATION, 
COMM I T_T IMESTAMPING,  CCMPUTE_RTC  and  REQUEST_RTS  under  the  strategy  of 
deriving  summary  tables  from  the  Transaction  and  Pseudo-transaction 
tables  is  shown  in  Figure  30.  Comparing  this  figure  with  Figure  23  on 
page  141,  one  can  see  that  the  nature  of  the  conflict  between 
COMPUTE_RTC  and  other  atomic  work  units  has  been  changed  from  a  direct 
conflict  on  PT  and  TT  to  be  that  on  PT  SUM  and  TT  SUM. 


Given  the  two  summary  table,  the  initiation  time  of  the  eldest  active 
transaction  rooted  in  a  data  segment  D  <  at  a  given  time  TS  can  be  found 
by  looking  into  the  summary  tables  corresponding  to  D,  according  to  the 
following  algorithm: 

(1)  Idnetify  the  current  bottom  entry  of  the  two  tables. 

(2)  For  each  of  the  tables,  examine  each  entry  from  the  bottom  of  the 

table  up  until  an  entry  e  with  a  commit  timestamp  earlier  than 


TS  is  found.  Then  obtain  the  initiation  time  value  of  the  entry 


hat  follows  e. 


(3)  Compute  the  minimum  of  the  following  three  values:  TS  and  the 
two  initiation  values  obtained  from  the  two  tables.  Return  this 
minimum. 

It  can  be  shown  that  this  algorithm  does  not  interfere  with  concur¬ 
rent  insertion  and  garbage  collection  processes  on  the  two  summary 
tables.  To  prove  this,  we  provide  the  following  facts  (Note  that  those 
properties  of  the  summary  tables  that  concern  the  following  facts  are 
missing  from  the  original  PT  and  TT  and  therefore  non-interfering  read¬ 
ing  of  the  tables  could  not  be  achieved  were  COMPUTE_RTC  to  directly 
access  PT  and  TT) : 

(1)  Insertion  to  the  summary  tables  always  occur  at  the  bottom  of  the 

table.  Those  newly  inserted  entries,  however,  will  not  be  of 
interest  tc  any  concurrent  COMPUTE_RTC  process  because  the  lat¬ 
ter  must  be  searching  for  an  entry  earlier  than  the  ones  being 
created. 

(2)  Garbage  Collection  of  the  summary  tables  always  occur  from  the 
top  of  the  tables.  However,  it  will  not  interfere  with  the  con¬ 
current  COM?UTE_RTC  processes  because  Garbage  collection  only 
strip  off  those  stale  entries  (i.e.,  entries  corresponding  to 
transactions  committed  before  certain  time wall)  that  would  no 
longer  be  the  target  entries  being  searched  for  by  a  concurrent 


COMPUTE  RTC. 


To  conclude,  we  nave,  by  designing  data  structures  that  are  best 
suited  for  C3MPUTE_RTC,  enables  the  process  of  evaluating  A  function  to 
proceed  without  having  to  be  interfered  by  any  concurrent  activity. 


S.O  HIERARCHICAL  DATABASE  DECOMPOSITION  METHODOLOGY 


8.1  INTRODUCTION  AND  SUMMARY  OF  RESULTS 


As  concluded  in  previous  chapters,  successful  application  of  the  HDD 
approach  to  concurrency  control  depends  on  an  intelligent  choice  of  the 
data  segment  hierarchy  on  which  the  protocols  are  based.  How  this 
choice  can  be  arrived  at  is  the  subject  of  the  current  chapter  on  decom¬ 
position  methodology.  In  essence,  we  would  like  to  develop  an  algorithm 
that,  given  the  description  of  a  database  application  which  is  captured 
in  a  specified  form,  will  efficiently  compute  a  favorable  database  seg¬ 
mentation  and  a  corresponding  data  segment  hierarchy  for  use  by  the  HDD 
concurrency  control  technique. 

This  chapter  is  composed  of  four  sections.  The  next  section  develops 
a  formal  model  for  describing  the  problem  of  hierarchical  decomposition. 
The  model  is  an  integer  programming  model  and  takes  the  description  of  a 
database  application  in  terms  of  'data  components'  and  'transaction 
types'.  The  objective  of  the  integer  programming  is  to  find  an  optimal 
scheme  for  clustering  these  data  components  into  a  hierarchy  of  data 
segments.  The  complexity  of  the  model  is  analyzed  in  the  subsequent 
section,  which  concludes  that  the  decomposition  problem  is  NP-hard, 
therefore  optimization  is  practical  only  for  very  small  problems.  To 
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remedy  this  situation 


a  heuristic  procedure  is  developed  in  section 


four  which  attempts  to  find  a  'good'  solution  in  a  reasonable  amount  of 
time. 


8.2  A  FORMAL  MODEL  OF  TK5  HIERARCHICAL  DATABASE  DECOMPOSITION  PROELEM 


In  this  section,  the  p“oblem  of  finding  a  favorable  database  segment 
hierarchy  given  the  description  of  a  target  database  application  is  for¬ 
mulated  into  an  optimization  problem. 

The  formulation  assumes  that  an  analysis  of  the  database  application 
has  been  performed  and  the  content  of  the  database  as  well  as  its 
accesses  by  transactions  is  understood  and  documented.  This  document 
must  describe  three  aspects  of  the  application: 

(1)  Data  components,  which  are  mutually  exclusive  and  collectively 
exhaustive  subsets  of  the  data  items  in  the  database  where  data 
items  within  the  same  subset  exhibit  generic  similarities.  For 
example,  an  EMPLOYEE  relation  could  be  considered  a  data  compo¬ 
nent.  Data  components  will  constitute  the  smallest  units  for 
consideration  as  data  segments  (i.e.,  data  segments  are  clusters 
of  data  components.) 

(2)  Transaction  types,  which  are  also  mutually  exclusive  and  collec¬ 
tively  exhaustive  subsets  of  the  update  transactions  to  be  run 
in  the  database.  Transactions  within  the  same  transaction  type 


have  similar  database  access  requirements.  Transaction  types 
will  constitute  the  smallest  units  for  consideration  as  trans¬ 
action  classes. 

(3)  Access  frequencies,  which  describe  the  frequencies  of  read  and 
write  accesses  from  each  of  the  transaction  types  to  each  of  the 
data  components. 

A  database  application  where  the  use  of  the  HDD  concurrency  control 
technique  potentially  produces  gains  is  one  in  which  accesses  from  some 
transaction  types  to  some  data  components  are  read-only.  These 
read-only  accesses  are  the  source  of  the  usage  of  the  H  protocol.  How¬ 
ever,  given  that  the  database  application  exhibits  the  potential  for 
applying  the  HDD  concurrency  control  technique,  one  must  also  discover 
how  data  components  ought  to  be  merged  and  fit  into  a  hierarchy  so  as  to 
maximize  the  use  of  the  H  protocol  and  minimize  the  use  of  the  L  proto¬ 
col. 

This  problem  of  discovering  the  optimal  data  segment  hierarchy  can  be 
formulated  as  an  integer  programming  problem  in  which  the  frequency  of 
usage  of  the  H  protocol  and  that  of  the  L  protocol  can  be  expressed  giv¬ 
en  an  assignment  of  the  data  components  to  a  data  segment  hierarchy. 
Formally,  given  a  database  application  which  is  composed  of  a  set  of 
data  components  DC,,  DCj,  ...,  DCn,  and  a  set  of  transaction  types  TP,, 
TP 2,  ...,  TPm,  and  given  that  the  read  and  write  access  frequencies  of 
transaction  type  TP,  to  data  component  DC,  are  known  as  r,..  and  w,.,. 


we  want  to  find  an  assignment  of  the  n  data  components  to  a  data  segment 
hierarchy  DSH  consisting  of  up  to  n  data  segments  DS1f  DS2»  . ..,  DSn, 
where  it  is  assumed  that  DS , ,  is  higher  than  DS,2  DSH  if  il  <  i2, 
(i.e.,  we  assume  that  the  index  of  a  data  segment  is  an  indication  cf 
the  position  of  that  data  segment  in  the  hierarchy,)  such  that  the 
access  gain  function,  defined  as  a  weighted  sum  of  the  frequencies  of 
the  usage  of  the  H  protocol  and  the  L  protocol,  is  maximized.  In  other 
words,  let 

ri.j  =  frequency  of  read  accesses  from  TP,  to  DCif  i  =  i,...,m,  j  = 

1. .  . . ,n, 

Wi.j  »  frequency  of  write  accesses  from  TP,  to  DCJf  i  *  j  * 

1. .  . . ,n. 

The  problem  is  to  solve  for  the  following  set  of  decision  variables  X 
and  Y : 

1  if  DC  j  is  assigned  to  DS* 

Xj  lk  - 

0  otherwise 

for  j  =  l,...,n  and  k  =  l,...,n,  and 

1  if  TP,  is  assigned  to  DSk 

•  ,  .  k  = 

0  otherwise 

for  i  =  l , . . . , m  and  k  *  x, ,  n . 

Subject  to  the  following  constraints: 

(I)  Each  data  component  must  be  assigned  to  one  and  only  one  data 


segment , 


I 


i.e.,  2  Xjik  =  l  for  j  =  l,...,n. 

k=l 


(2)  Each  transaction  type  must  be  rooted  in  one  and  only  one  data 
segment, 


i.e.,  I  Y1ik  =  l  for  i  =  1, 
k=l 


(3)  A  transaction  type  will  not  write  into  a  data  segment  higher  than 
its  own  root  segment. 


i.e.,w,,j*Xjik*  I  Yiih  =  0  for  all  i,j,k. 

h=k+l 


The  objective  is  to  maximize  the  number  of  read  accesses  to  higher 
data  segments  from  transactions  rooted  in  lower  data  segments  (i.e., 
usage  of  the  H  protocol)  and  to  minimize  the  number  of  read  and  write 
accesses  to  lower  data  segments  from  transactions  rooted  in  higher  data 
segments  (i.e.,  usage  of  the  L  protocol.)  With  the  above  definitions, 
the  frequency  of  use  of  the  H  and  the  L  protocols  can  be  expressed  as 
follows: 

(1)  Frequency  of  the  K  protocols 


m  n  n  n 

=  2  2  I  I  r,  ,j  »  Xj ,K  *  Y,  ,h 

i=l  j=l  k=l  h=k+l 


(2)  Frequency  of  the  L  protocols 


m  n  n  k-1 

=  2  I  2  I  (r,,j  +  w,,j)  *  XJtk  *  Y-,h 

i=l  j=i  k=l  h=l 
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We  will  formulate  the  objective  function  as  the  first  term  above  minus 


the  second  term  weighted  by  the  'distance'  in  the  segment  hierarchy 
between  the  root  data  segment  and  the  data  segment  being  accessed,  and 
by  p,  the  cost  ratio  of  the  L  protocol  over  the  H  protocol,  (i.e..  The 
gains  of  the  use  of  one  H  protocol  will  be  offset  by  the  cost  of  the  use 
of  p  times  the  L  protocols  of  'distance'  one.)  Formally,  let  the  weight 
of  using  the  L  protocol  to  access  a  lower  data  segment  DSj2  &Y  a  trans¬ 
action  rooted  in  a  higher  data  segment  DSj,  be  dj1ij2.Then  Sji,j2  * 
p(jl-j2)  for  jl  <  j2 .  The  objective  function  can  be  formulated  as  fol¬ 
lows: 


m  n  n  n 

MAX  X  X  X  (  X  ri,j  *  Xj,k  »  ?i.h 

i=l  j=l  k=l  h=k+l 

k-1 

~  2  (r, ,  j  +  wt,j)  *Xj  _i<*yiin*df1,k) 

h=l 


Now  we  summarize  the  above  formulation  by  presenting  the  entire  prob¬ 
lem  formulation  for  the  HAC  problem  as  follows: 


m  n  n  n 

MAX  I  X  X  (  X  r,.j  *  XJ-k  *  Y,.r, 

i=l  j*l  k=l  h=k+l 

k-1 

X  (r, j  +  wli3)  »  Xj ik  *  y, ,h  *  dh> k) 
h=l 

Subject  to 
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p 

L  • 


I 


n 

(1)  I  XJik  =  1,  Xj.k  =  0,1 
3-1 

m 

(2)  I  Y,  ,k  =  1,  Yi >k  =  0,1 
i=l 


n 

(3)  W,tj  *  Xjik  *  I  Y,th  ■  0 
h=k+l 

for  sll  i  *  l, . . .  ,m,  j  •  l,*.«,n,  sud  k  —  l,  •  • « ,n* 


The  above  formulation  will  from  now  on  be  referred  to  as  the  Hierarchi- 

cal  Assignment  Optimization  Problem,  or  the  HAD  problem.  --  ■* 


8.3  COMPLEXITY  OF  THE  HIERARCHICAL  ASSIGNMENT  OPTIMIZATION  PROBLEM 

The  integer  programming  optimization  problem  presented  in  the  previ¬ 
ous  section,  once  solved  given  a  particular  set  of  r  and  w  parameters, 
will  produce  the  optimal  segment  hierarchy  for  use  by  the  HDD  concurren¬ 
cy  control  technique.  However,  the  complexity  of  the  HAC  problem 
appears  to  prohibit  an  exhaustive  search  for  an  optimal  solution  when 
the  problem  is  of  a  non-trivial  size.  In  this  section,  the  complexity 
of  the  HAO  problem  is  analyzed.  The  theory  of  NF-completeness  is 
applied  to  show  that  the  HAO  problem  is  NP-equi valent,  which  amounts  to 


class  N?. 


This  conclusion  signals  the  need  for  an  efficient  heuristic 


procedure  for  producing  'good'  solutions  for  the  HAO  problem. 

We  will  first  review  and  clarify  the  concept  of  NP-hardness.  A  prob¬ 
lem  p  is  said  to  be  NP-hard  (i.e.,  at  least  as  hard  as  an  NP-complete 
problem)  if  there  exists  an  NP-complete  problem  p'  such  that  p'  is 
Turing  reducible  to  p  <Garey79>.  Turing  reducibility  from  p'  to  pis 
established  by  asserting  the  existence  of  a  (polynomial  time)  Turing 
reduction  algorithm.  A  Turing  reduction  algorithm  from  a  search  problem 
p'  to  another  search  problem  p  is  an  algorithm  A  that  solves  p'  by 
using  a  hypothetical  subroutine  S  for  solving  p  such  that  if  S  were  a 
polynomial  time  algorithm  for  p,  then  A  would  be  a  polynomial  time 
algorithm  for  p' .  Intuitively,  this  means  that  one  can  show  that  a 
problem  p'  is  Turing  reducible  to  p  by  demonstrating  that  p'  can  be 
solved  by  calling  a  polynomial  number  of  times  the  algorithm  that  solves 
p  as  a  subroutine. 

3ased  on  the  above  discussion,  to  prove  that  HAO  is  NP-hard,  one  must 
find  an  NP-complete  problem  p'  and  show  that  p'  is  Turing  reducible  to 
HAO.  To  facilitate  the  proof  procedure,  we  will  re-phrase  the  integer 
programming  formulation  of  the  HAO  problem.  The  new  formulation  will 
express  the  sets  of  0-1  decision  variables  X  and  Y  as  functions  f  and  g 
where  f  assigns  data  components  to  data  segments  and  g  maps  transaction 
types  to  their  root  data  segments.  In  addition,  the  set  of  weight  val- 
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ues  d  is  augmented  in  order  to  simplify  the  expresssion  of  the  objective 
function.  This  new  formulation  is  given  below: 

The  Hierarchical  Assignment  Optimization  (HAO)  Problem: 

Instance:  Given  r^j,  w(,j,  i  =  l,..,m,  j  =  l,...,n,  and  dhik,  h,lc  = 
l,...,n,  where 


0  if  h  =  k 


dh  .  k 


if  h  >  k 


Ip(h-k)  if  h  <  k 


where  p  is  a  positive  number. 

Objective:  Find  functions  f  and  g  where 

f :  {l  , 2 1 . . .  ,  n}  “ ^  {l  1 2 « • . .  t  n}  ,  and 

g:  {l,2, . . .  ,m}  ->  {l,2,...,n},  such  that 

(HAO-C1)  for  all  i  and  j  where  w1fJ  *  0,  g(i)  £  f(j),  and 

(HAO-Obj )  the  weighted  sum  I  I(r,j  +  w1fj)  «  d3(1)if(3)  is  maximized. 

i  3 


Theorem.  The  Hierarchical  Assignment  Optimization  problem  is 
NF-hard . 

Proof .  The  proof  consists  of  two  parts.  The  first  part  proves  that 
the  decision  problem  counterpart  of  the  HAO  problem,  called  the  Hierar¬ 
chical  Assignment  problem,  or  the  HA  problem,  is  NP-complete.  The  sec¬ 
ond  part,  using  relatively  standard  arguments,  proves  that  the  HA 
problem  is  Turing  reducible  to  the  HAO  problem,  therefore  by  the  defi¬ 


nition  cf  NP-hardness,  KAO  is  HP-hard. 


To  prove  the  first  part,  we  will  first  formulate  the  decision  problem 
counterpart  of  the  KAO  problem,  namely,  the  Hierarchical  Assignment 
Problem  as  follows: 

The  Hierarchical  Assignment  (HA)  Problem: 

Instance:  Given  rj,j,  Wj.j,  i  =  j  =  l,  —  ,n,  a  positive  intenger 

Q,  and  dhik,  h,k  =  l,...,n,  where 
r  o  if  h  =  it 
<Jh,k  ■<  1  if  h  >  k 

, p(h-k)  if  h  <  k 
where  p  is  a  positive  number. 

Question:  Does  there  exist  two  functions  f  and  g  where 

f:  {l,2, . . . , n}  {l,2, . . . , n} ,  and 

g:  (l ,2 , . . . ,m}  ->  {l,2,...,n},  such  that 

(HA-Cl)  for  all  i  and  j  where  w1i:  *  0,  g(i)  S  f(j),  and 

(HA-C2)  the  weighted  sum  I  Z(r1,j  +  w1tj)  *  6g<i>,f<j>  is  at  least  Q? 

i  : 

Lemma.  HA  is  N?~complete. 

Proof .  (1)  N?  class  membership:  It  is  obvious  that  given  f  and  g, 

the  decision  problem  can  be  verified  ir.  polynomial  time  by  merely  (a) 
computing  the  sum  and  checking  ct  against  Q,  and  (b)  verify  that  g(i)  S 
f(j)  for  all  i  and  j  where  w,tJ  #  0. 


(2)  N?  hardness:  HA  is  NP-hard  if  one  can  find  a  problem  pknowr.  to 
be  NP-complete  such  that  p  is  (polynomial)  transformable  to  HA.  We 


will  present  a  proof  using  the  technique  of  restriction,  which  identi¬ 
fies  a  restricted  instance  of  HA  and  equate  it  to  a  known  NP-complete 
problem,  namely,  the  Maximum  Cut  (MCUT)  problem. 

i  Restricting  the  HA  problem: 

l 

Step  l:  Let  the  parameters  vijj  be  restricted  such  that  there  exists 
i  a  function  ROOT:  {l,2,...,m}  ->  {l,2,...n}  such  that  w^,  =  o  for  ail  j 

#  ROOT ( i ) .  (i.e.,  Every  transaction  type  T? j  writes  in  only  one  data 

component,  namely,  ROOT(i).) 

The  purpose  of  this  restriction  is  to  enable  the  dropping  cf  the 
function  g  and  the  constraint  HA-C1  from  the  KA  problem.  We  achieve 
this  by  restricting  the  problem  so  as  to  make  function  g  completely 
derivable  from  the  function  f  given  another  known  function  ROOT.  The 
reason  that  we  could  do  this  is  as  fellows.  Since  dnik  is  positive  only 
when  h  >  k,  there  is  an  incentive  built  into  tne  problem,  due  to  the 
nature  of  the  objective  function,  to  maximize  g(i)  so  as  to  increase  the 
chance  that  the  multiplier  actually  used  in  computing  the 

objective  function  is  positive.  However,  maximizing  g(i)  is  subject  to 
constraint  (HA-C1) ,  i.e.,  g(i)  cannot  be  greater  than  f(j)  for  ail  j 

where  w,,.  *■  0.  Combining  this  observation  with  the  current  restriction 
that  w,  ,  =0  for  ail  j  #  ROCT(i),  it  is  clear  that  the  optimal  choice 


cf  g(i)  under  the  current  restriction  is  g(i)  =  f(ROOT(i)).  Therefore, 


in  this  restricted  version  of  the  HA  problem,  the  function  g  is  entirely 
derivable  from  function  f,  and  the  constraint  HA-Ci  can  be  dropped. 


Step  2:  We  will  subject  the  problem  to  the  following  additional  two 
restrictions: 
m  *  2 
n  >  m. 

With  these  restrictions,  and  the  fact  that  max(f(j))  £  min(m,n),  the 
function  f  can  now  be  defined  as  follows: 
f:  (l,2, . . . ,n}  {l,2}. 

The  reason  for  f(j)  to  be  bounded  from  above  by  min(m,n)  is  that  the 
number  of  data  segments  contained  in  ah  optimal  data  segment  hierarchy 
cannot  be  more  than  the  number  of  transaction  types,  because  if  this 
were  not  the  case,  then  there  would  exist  some  data  segment  not  rooted 
by  any  transaction  type  that  can  be  merged  with  its  parent  data  segment 
in  the  segment  heirarchy  without  incurring  a  penalty  to  the  objective 
function,  and  thereby  reducing  the  number  of  data  segments  in  the  opti¬ 
mal  data  segment  hierarchy. 


The  purpose  of  this  step  2  restriction  is  to  transform  the  assignment 
problem  into  a  2-partition  problem. 


Step  2 :  let  q,., 

applied  is  as  follows: 


w,,j.  The  last  set  cf  restrictions  tc  be 


p  <  l,  and 
qitJ  =  qj , i  for  all  i  and  j. 

These  retrictions  are  applied  to  transform  the  problem  from  a  directed 
partition  problem  into  a  simple  (i.e.,  undirected)  one. 

The  original  problem  with  these  three  steps  of  restrictions  can  now 
be  reformulated  as  follows: 

The  Restricted  HA  problem: 

Instance:  Given  r1tj,  witj,  where  i  =  1,2  and  j  =  l,2,...,n,  and  n  >  2, 
a  function  ROOT:  {l,2}  ->  {l,2,...,  n} ,  a  set  of  d  values  where  d1f1  =  d 
2.2  =  0 ,  d 2 , i  *  1  and  d1i8  =  p  where  0  <  p  <  1,  and  a  positive  integer 

Q,  and  that  these  parameters  satisfy  the  constraints  (l)  ritj  +  w1tj  =  r 
j , i  +  Wj , (  for  all  i,j  and  (2)  witj  =  0  for  all  pairs  of  i  and  j  where  j 
*  ROOT ( i ) . 

Question:  Does  there  exist  a  function  f:  {l,2,...,n}  ->{l,2} 

such  that  I  X  (rifj  +W(,j)*  <2f  <  root  c  i  >  > ,  f  u  >  is  at  least  Q? 
i  j 

We  will  now  show  that  the  Restricted  HA  problem  is  identical  to  the 
Maximum  Cut  problem  defined  below: 

The  Max  Cut  (MCUT)  problem: 

Instance:  Given  a  graph  G  =  (V,E),  where  V  =  {i,2,..,v}  ,  weights  on 

edges  w(k,j)  e  positive  integers  for  all  edges  e(k,j)  e  E,  and  a  posi- 


ive  integer  Q'. 


Question:  Is  there  a  partition  of  V  into  disjoint  sets  Vi  and  V2  such 

that  the  sum  of  the  weights  of  the  edges  from  E  that  have  one  end  point 
in  Vi  and  one  end  point  in  V2  is  at  least  Q'? 

Let  p'  be  the  smallest  integer  such  that  p'/(--p)  is  an  integer.  The 
process  of  equating  MCUT  To  Restricted-HA  is  as  follows: 

(1)  Let  V  =  {l, 2 , . . . ,  n]  . 

(2)  Let  w(k,j)  =  p'  *  (rjj  +  w,.j),  where  X  =  ROOT(i). 

(3)  Let  Q’  =  Q  *  p’/U-p) . 

Since  MCUT  is  NP-complete  <Garey79>  and  Restricted-HA  is  identical  to 
MCUT,  we  conclude  that  MCUT  is  transformable  to  HA,  and  therefore  HA  is 
NP-complete.  Q.E.D. 

Mow  we  must  show  that  HA  is  Turing  reducible  to  HAO  to  complete  the 
proof  of  our  theorem.  This,  however,  is  obvious  because  if  there 
existed  a  polynomial  algorithm  A  that  solves  HAO,  then  HA  can  be  easily 
answered  by  calling  A  once  and  compare  the  optimal  value  produced  by  A 
with  the  decision  threshculd  Q  in  HA.  Q.F.l 7. 


8.4  A  HEURISTIC  ALGORITHM  FOR  THE  HIERARCHICAL  ASSIGNMENT  OPTMIZATION 


In  this  section  we  will  present  a  heuristic  procedure  for  producing  a 
good  but  not  necessarily  optimal  solution  for  the  hierarchical  assign¬ 
ment  optimization  (KAO)  problem  introduced  in  the  previous  sections, 
.'he  heuristic  procedure  is  based  on  an  analysis  of  the  special  structure 
of  the  HAO  problem.  This  analysis  sheds  light  on  two  aspects  of  the 
problem:  first,  how  one  may  find  a  potentially  good  initial  solution  for 
the  problem;  second,  how  one  may  improve  on  this  initial  solution. 

8.4.1  THE  TABULAR  REPRESENTATION  OF  A  SOLUTION  TO  THE  HAO  PRC31EM 

We  will  show  that  any  solution  to  the  HAO  problem  can  be  represented  in 
a  tabular  form  in  which  the  individual  elements  in  the  two  summation 
components  of  the  objective  function  are  separately  displayed.  Based  on 
this  observation,  the  objective  of  the  heuristic  procedure  is  to  identi¬ 
fy  a  solution  whose  tabular  form  possesses  the  property  that  the  region 
representing  the  positive  component,  of  the  objective  function  is  maxi¬ 
mized  while  the  region  representing  the  negative  component  is  minimized. 
This  observation  is  used  to  guide  the  design  of  our  heuristic  procedure. 
We  will  therefore  first  briefly  describe  this  tabular  form  of  solution 
to  the  HAO  problem. 

let  the  parameters  of  a  database  application  be  represented  by  an 
access  frequency  table  in  which  there  are  m  rows  reprenttng  m  trans¬ 
action  types  and  n  columns  representing  r.  data  components.  Each  cell  of 
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the  table  is  composed  of  two  values  ri.j  and  w1#j.  An  example  of  such  a 
table  is  shown  in  Figure  31. 

Given  any  feasible  solution  S  =  {f,  g}  to  an  HAO  problem  with  an 
access  frequency  table  T.  where  f  and  g  are  defined  as  in  the  previous 
section,  we  can  partition  the  cells  in  T  into  three  lcinds: 

(1)  The  E  cells:  A  cell  (i,j)  in  T  is  an  E-protocol  cell,  (or  E  cell) 

if  there  exists  k  such  that  f(i)  =  k  and  g(j)  =  k. 

(2)  The  H  cells:  A  cell  (i,j)  in  T  is  an  H-protocol  cell,  (or  H 

cell)  if  g(i)  =  k2  and  f(j)  =  kl  and  kl  <  k2. 

(3)  The  L  cells:  A  cell  (i,j)  in  T  is  an  L-protocol  cell  (or  L  cell) 
if  g(i)  =  kl  and  f(j)  =  k2  and  kl  <  k2. 

In  essence,  a  cell  is  an  E  cell  (H  cell,  L  cell)  if,  under  solution 
S,  the  accesses  to  the  database  described  by  that  cell  have  to  use  the  E 
protocol  (the  K  protocol,  the  L  protocol.)  By  the  definition  of  a  fea¬ 
sible  solution,  all  w1tj  where  (i,j)  is  an  H  cell  must  be  zero.  The 
objective  function  value  corresponding  to  a  feasible  solution  S  can  be 
given  as  the  following: 

(Ob  j  )  Z  C  ,  ,  j  —  Z  (‘-i.J+W,ij)*dg(1)if(j) 

(i,j)  is  (i, j)  is 

an  H  cell  an  L  cell 

The  tabular  representation  of  a  solution  S  for  a  problem  with  an  access 
frequency  table  T  is  a  permutation  of  T  in  which  these  three  kinds  of 
cells  are  neatly  bundled  together.  Formally,  the  tabular  representation 
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of  an  access  table  T  given  a  solution  S,  denoted  as  p(T,S),  is  a  table 
produced  by  Fermutating  rows  and  columns  in  T  using  the  following  rules: 

(1)  Permutate  columns  of  table  T  such  that  if  f(jl)  =  itl  and  f(j2)  = 

k2  and  kl  <  k2  then  column  jl  is  before  column  j2  in  the  repre¬ 

sentation. 

(2)  Permutate  the  rows  of  table  T  such  that  if  g(il)  =  kl  and  g(i2)  = 

k2  and  kl  <  k2  then  row  ii  is  before  row  i2  in  the  represen¬ 

tation. 

It  is  interesting  to  see  that  in  the  tabular  representation  the  E  cells 
partition  the  rest  of  the  cells  diagonally  into  two  regions,  where  the 
lower-left  region  corresponds  to  the  collection  of  H  cells,  and  the 
upper-right  region  corresponds  to  the  collection  of  L  cells.  An  example 
of  the  tabular  representation  of  a  solution  to  the  example  table  shown 
in  Figure  31  is  shown  in  Figure  32. 

The  property  that  the  three  kinds  of  cells  are  bundled  into  regions 
in  the  tabular  representation  enables  a  visualization  of  a  solution  S. 
Solving  for  the  HAO  problem  with  a  table  T  can  now  be  considered  as  a 
process  of  shuffling  the  columns  and  rows  in  T  and  creating  a  diagonal 
region  (1)  below  which  the  constraint  on  H  cells  is  enforced  and  (2) 
that  the  values  in  the  cells  in  the  lower-left  region  is  maximized 
w.r.t.  those  in  the  upper-right  region.  The  solution  captured  in  the 
tabular  representation  then  can  be  read  from  the  permutated  sequence  of 
the  columns  and  the  span  of  the  diagonal  region.  This  pictorial  expla- 
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Assuming  a  solution  to  the  above  problem  is  the  following 
data  segment  hierarchy  DSH*  : 


The  tabular  solution  with  each  regions  delineated  is  as 


ro-.ews : 


DC 

T? 

N0N_L£AF 

T5L  ZZJSPZ 

f  20750)^ 
'////////////^ 

ESH9H 

(120,0) 

f 2S0 ,0) 
s 

LEAF 


(22,0) 


.r  ^  r\  ^  r\  \  /J 


(0, 


r\\ 


(30,30) 


7///////  /  TB__ 

”  140,0)  |^(Vo77o')|^ 


Upper 

right 

region,  or 
the  1  region 


Lower  left  region, 
or  the  H-cell  region 


Diagonal  region, 
or  the  I-cell 
retior.  (shaded  are. 


example  of  the  tabular  representation  of  a  solution 


The  heuristic  procedure  consists  of  two  stages  as  shown  in  Figure  33. 
The  first  stage  attempts  to  find  a  ’favorable'  permutation  of  the  data 
components.  Once  this  permutation  #  is  found,  where  #(i)  denotes  the 
index  of  the  data  component  chosen  to  be  the  i-th  data  component  in  the 
permutation,  the  function  f  is  simply  one  which  assigns  data  components 
to  data  segments  in  a  one-to-cne  manner  where  data  component  DC#(i)  is 
assigned  to  data  segment  DS*.  The  function  g,  derivable  from  function 
f,  assigns  the  highest  data  segment  in  f  that  a  transaction  type  writes 
in  to  be  its  root  data  segment.  No  merging  of  multiple  data  components 
to  form  one  data  segment  is  considered  at  this  stage.  Therefore  the 
solution  produced  by  this  stage  would  result  in  a  data  hierarchy  con¬ 
sisting  of  n  data  segments,  and  a  transaction  type  is  assigned  to  be 
rooted  in  the  highest  data  segment  it  has  ro  write  into. 

The  second  stage  is  one  in  which  a  local  interchange/merging  process 
of  a  certain  ’degree'  K  is  performed  on  the  first-stage  solution  S'1’  to 
improve  on  the  latter  and  to  arrive  at  a  final  solution  S<2>.  if  the 
objective  function  value  of  the  solution  S(2’,  denoted  as  v(S(2’),  is 
non-positive,  then  the  heuristic  procedure  has  failed  to  produce  a  fea¬ 
sible  data  segment  hierarchy  that  will  produce  net  gains. 


■'1*1 


8.4.3  THE  FIRST  STAGE  OF  THE  HEURISTIC  PROCEDURE 


The  objective  of  the  first  stage  is  to  find  a  reasonable  permutation 
#  of  the  data  components.  The  'reasonableness'  of  this  initial  solution 
hinges  upon  the  likelihood  of  the  existance  of  a  'good'  solution  S<2>  in 
which  those  data  components  that  are  close  to  each  other  in  the  permuta¬ 
tion  #  are  either  merged  or  assigned  to  data  segments  that  are  close  to 
each  other.  If  such  a  goold  solution  does  exist,  then  it  can  be  found 
in  the  second  stage  from  the  first-stage  solution  s(1>  with  a  proper 
choice  of  the  degree  k. 

In  our  algorithm,  #  is  chosen  in  a  sequential  manner  by  selecting, 
based  on  some  rules,  data  component  jl  to  be  #(l),  then  data  component 
j2  to  be  #(2),  and  so  on.  Therefore  this  process  is  composed  of  a  total 
of  n  steps,  at  the  i-th  step  of  which  #(i)  will  be  selected  from  the 
remaining  pool  of  data  components.  At  any  step  we  must  evaluate  the 


remaining  data  components  to  find  the  one  that  might  produce  the  bigges1 


gain  when  assigned  to  #(i).  To  avoid  combinatorial  explosion,  we  wish 
to  limit  our  consideration  to  just  the  immediate  impact  of  the  next 
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selection  on  the  value  of  the  objective  function,  without  regard  to  the 
impact  of  the  ordering  in  #  of  all  the  yet-unselected  ones.  We  will 
compute  this  immediate  impact  by  pretending  that  the  remaining  data  com¬ 
ponents  would  be  merged  once  the  next  selection  is  made,  and  therefore 
the  component  selected  should  be  the  one  that  will  maJce  the  biggest  con¬ 
tribution  to  the  objective  function  under  this  assumption.  This 
procedure  embodies  the  philosophy  of  a  'greedy'  move  often  encountered 
in  the  design  of  heuristic  procedures,  which  selects  the  next  node  to  be 
evaluated  in  the  branching  tree  by  computing  the  immediate  gains  only. 


An  additional  rationale  for  this  algorithm  is  that  it  is  recognized 
that  the  optimal  solution  to  the  HAO  problem  is  one  in  which  those  data 
components  that  many  transaction  types  have  high-volume  read-only 
accesses  to  are  assigned  to  data  segments  that  are  relatively  higher 
than  some  others,  so  as  to  enable  these  read-only  accesses  to  use  the  H 
protocol  and  therefore  increase  the  value  of  the  objective  function. 
The  algorithm  in  stage  1  has  the  property  that  it  tends  to  select  data 
components  that  many  transaction  types  generate  high-volume  read-only 
accesses  to  as  first  ones  in  the  sequence  #.  This  increases  the  chance 


that  the  initial  solution  is  close  to  some  good  solution 


This  procedure  is  formally  described  as  follows.  Let  i  i  1, 

be  the  solution  produced  right  after  the  i-th  component  of  #  is 


computed,  i.e. 


_  {  gd.i)  |  f  ( i  •  ’  )(#(j))  =  j  for 


3*1, 


f ( 1  ’  ’ > (j) 


i+l  for  all  j  not  in  {#(1), 


#  (2) , . .  (i) }  , 

g(1,1)(k)  =  mir.  {f(1,i>(j)  |  wkij  *  ol } 

The  heuristic  rule  for  choosing  #(i)  is  as  follows:  choose  #(i),  1  £ 
#(i)  S  n,  #(i)  not  in  {#(1),  #(2),  ...»  #(i-l)},  such  that  v(S(1,1')is 
maximized . 


8.4.4  THE  SECOND  STAGE  OF  THE  HEURISTIC  PROCEDURE 


In  the  second  stage,  attempts  are  made  to  improve  the  solution 
produced  by  the  first  stage  through  performing  up  to  n  local 
interchange/merging  steps.  Each  of  these  steps  is  performed  on  top  of 
the  intermediate  solution  produced  in  the  previous  step  and  is  guaran¬ 
teed  to  generate  a  solution  at  least  as  good  as  the  previous  one.  An 
informal  description  of  the  algorithm  is  as  follows. 


At  the  first  step,  the  solution  S' produced  by  the  first  stage  is 
taken  as  the  initial  solution.  The  bottom  K  data  segments  in  this  sol¬ 


ution,  for  a  fixed  and  small  number  K,  are  examined  and  evaluations  are 
made  for  ail  possible  re-arrangements,  including  merging  of  these  K  daoa 
segments,  while  keeping  the  rest  of  the  data  segments  where  they  were  in 


he  ir.ital  solution 


An  arrangment  of  these  K  data  segments  is  then 


chosen  which  is  concatenated  from  the  bottom  with  the  rest  of  the  data 
segments  in  the  initial  solution  to  form  the  solution  for  the  first 
step.  At  the  next  step,  taxing  the  solution  produced  by  the  previous 
step  as  the  initial  solution,  the  same  process  is  repeated  for  the  K 
data  segments  that  are  as  close  to  the  bottom  as  possible  but  subject  to 
the  constraint  that  at  least  one  new  data  segment  that  was  outside  of 
the  scope  of  the  K  data  segments  considered  in  the  previous  step  is 
involved.  This  process  is  repeated  until  all  data  segments  in  the  ini¬ 
tial  solution  S ( 1 ’  has  been  included  in  at  least  one  of  the  local 
interchange/merging  steps. 

The  rationale  behind  this  algorithm  is  the  belief  that  it  is-  very 
likely  that  the  initial  solution  produced  in  the  first  stage  is  a  rough 
cut  of  a  good  solution  that  local  refinement  of  the  sort  described  in 
the  above  algorithm  can  bring  out.  Referring  to  the  tabular  represen¬ 
tation  of  the  solution,  each  step  in  this  second  stage  looks  at  a  bundle 
of  data  segments  and  evaluates  the  result  of  different  ways  of  rearrang¬ 
ing  these  data  segments.  Then  in  the  next  step,  moves  the  bundle  to  be 
considered  up  by  one  data  segment.  Every  time  a  bundle  is  examined,  the 
area  covered  by  that  bundle  is  refined  to  make  the  left-lower  region  in 
the  tabular  representation  even  'heavier',  thereby  bringing  the  solution 
closer  to  a  desired  form.  An  example  of  this  process  in  its  tabular 


form  is  shown  in  Figure  34. 


The  algorithm  is  formally  described  as  fellows:  Let  the  solution 


produced  by  step  i  of  the  second  stage  be  S 1 2 •  1  ’ ,  the  number  of  data 
segments  in  S<2,1>  be  denoted  as  qhl.  At  the  beginning  of  step  i,  giv¬ 
en  Sl2,1_1),  the  task  is  to  find  a  function  G( 1  )  =  {  Q,  Q+l,  . ..,  Q+K-l 
}  ->  {  Q,  Q+l,  — ,  Q+K-l  }  where  Q  =  min  {  n-K-i+2,  :"‘1’-K+1  },  such 
that  v(S<2,1))  is  maximized,  where  s'2,5’  =  {f‘2,1),  g(2,1)}  and  f(2,i) 
is  defined  as  follows: 


(f t2- >  for  j  <  Q 
=|g( 1 1  ( j)  for  Q  5  j  s  Q+K-J 
(2.1-1)  _  e  5or  2  >  Q+K-l, 
where  e  =  (Q+K-l)  -  Max  {  } 


The  complexity  of  the  second  stage  algorithm  depends  on  the  choice  of 
K,  as  the  total  number  of  evaluation  to  be  performed  in  this  stage  is  in 
the  order  of  n  *  K:  *  (C(k,l)  +  C(k,2)  +  ...  +  C(K,K)),  clearly 
non-polynomial  in  K.  However,  the  larger  K  is,  the  higher  is  the  proba¬ 
bility  that  the  final  solution  found  at  the  end  of  the  second  stage  is 
closer  to  the  optimal  solution.  Therefore  choice  of  K  represents  a 
tradeoff  between  closeness  to  the  optimal  solution  and  the  consideration 
for  computing  cost. 
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9.0  APPLICATIONS  OF  THE  HDD  APPROACH  TO  DATABASE  CONCURRENCY  CONTROL 


This  chapter  is  devoted  to  application  studies.  The  purpose  of  these 
studies  is  to  illustrate  how  the  HDD  approach  can  be  effectively 
applied.  In  the  first  of  these  three  cases,  we  show  how  the  hierarchi¬ 
cal  algorithm  can  be  used  to  improve  performance  of  a  database  with  a 
heavily  congested  index  area.  The  case  is  an  application  of  the  hierar¬ 
chical  method  in  a  simple  cyclic  database  partition  composed  of 
essentially  two  data  segments.  In  the  second  case,  we  show  how  the  HDD 
approach  can  be  used  to  restructure  a  banking  application  from  its  cur¬ 
rent  segregated  state  of  operation  into  a  real-time  database  operation 
without  increasing  data  contention.  It  also  exemplifies  the  advantage 
of  using  the  HDD  approach  to  structure  databases.  In  the  third  case,  we 
show  how  database  computers  with  large-scale  multiprocessing  activities 
can  benefit  from  a  design  guieded  by  the  concept  of  hierarchical  decom¬ 
position  -  how  the  latter  helps  reduce  the  inter-processor  interference 
due  to  database  concurrency  control. 


9.1  CONCURRENCY  CONTROL  IN  AN  INDEXED  DATABASE 

In  this  section  we  present  an  application  of  the  hierarchical  decom¬ 
position  approach  to  concurrency  control  in  a  database  management  system 


that  uses  multi-level  indices  as  access  paths.  Dividing  the  database 
into  an  index  data  segment  and  a  data  record  segment,  we  obtain  a  simple 
decomposition  of  the  database  into  two  data  segments.  This  case  there¬ 
fore  illustrates  an  application  of  the  HTS  algorithm  to  a  simple 
two-data  segment  cyclic  data  partition.  The  motivation  behind  choosing 
this  arrangement  as  one  of  the  cases  for  demonstration  is  two-fold. 
First,  the  indexed  database  is  a  structure  that  is  popular  in  database 
management  systems.  Second,  this  case  is  easy  to  understand  and  is  sim¬ 
ple  in  terms  of  the  structure  of  the  data  segment  hierarchy  (consisting 
of  only  two  data  segments)  but  clearly  demonstrates  one  particular 
aspect  of  the  effectiveness  of  the  HTS  algorithm  -  how  it  can  be  used  to 
relieve  contention  in  a  high-traffic  area  of  the  database,  in  this  case, 
the  index  area.  .  .  .. 

The  structure  of  this  section  is  as  follows.  We  will  in  the  first 
subsection  review  the  basic  data  structures  and  operations  on  indices 
organized  as  B-trees.  It  will  then  be  shown  that  synchronizing  trans¬ 
actions  and  their  accesses  to  index  records  may  result  in  degradation  of 
performance  due  to  the  higher  traffic  volume  in  the  index  area. 
However,  if  only  a  portion  of  the  transactions  will  cause  an  update  in 
the  index  area,  and  that  they  can  be  declared  before  they  start,  we  will 
in  the  second  subsection  show  how  the  HTS  algorithm,  by  assigning  the 
index  area  to  the  higher-level  data  segment  in  the  segment  hierarchy, 
achieves  the  effect  of  giving  preferential  treatment  to  the  index  area 
and  allows  transactions  that  update  the  index  area  to  proceed  with  lit- 


tie  interference  from  transactions  that  do  not  have  to  update  the  index 
area.  An  intuitive  example  will  be  given,  followed  by  a  quantitative 
analysis  via  a  simple  analytical  model  that  attempts  to  capture,  for 
comparison  purposes,  the  essence  of  the  contention  behavior  in  this  case 
under  two  different  concurrency  control  methods:  two  phase  locking  and 
the  HTS  algorithm. 

9.1.1  THE  B-TRSE  INDEX  STRUCTURE  AND  RELATED  RESEARCH 

The  B-tree  index  structure  is  a  multi-level  indexing  scheme  with 
dynamic  balancing.  As  shown  in  Figure  35,  in  a  multi-level  indexing 
scheme,  every  index  record  consists  of  a  number  of  elements  of  the  form 
<poir.ter-to-next-levei-index-record,  high-key>.  At  the  leaf  level  of 
the  multi-level  index  tree,  the  elements  in  the  index  record  have  the 
form  of  <pointer-to-data-record,kev>  and  contain  pointers  to  the  data 
records  themselves.  To  search  for  a  data  record  with  a  particular  key 
value  (or  a  set  of  records  within  a  certain  range  of  the  key  values) ,  a 
transaction  scans  the  highest  level  index  record,  identifies  the  element 
in  the  index  record  with  the  lowest  high-key  value  that  is  higher  than 
the  key  value  being  searched  for,  and  advances  to  the  next-level  index 
record  by  following  the  pointer  contained  in  that  element.  It  repeats 
the  process  until  the  desired  data  record  is  finally  located. 

Multi-level  indices  could  remain  unchanged  when  data  records  are 
inserted,  deleted  and  udpated.  This  static  scheme  is  suitable  for  data- 
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bases  that  are  not  updated  very  frequently.  However,  if  a  database  is 


udpated  frequently,  then  as  the  database  evolves,  an  initially  balanced 
multi-level  index  may  become  grossly  unbalanced  and  cause  serious  per¬ 
formance  degradation  in  record  searching.  Therefore  under  a  static 
scheme  the  database  is  subject  to  periodic  reorganization  to  produce 
new,  balanced  indices.  An  alternative  scheme  is  to  introduce  dynamic 
balancing  as  the  database  evolves.  This  gives  rise  to  using  the  B-tree 
scheme  for  organizing  indices. 

Dynamic  indexing  is  very  similar  to  the  static  scheme  in  its  data 
structures.  The  main  difference  lies  with  the  additional  activity  for 
dynamic  balancing  associated  with  each  insertion  and  deletion.  In  a 
B-tree  indexing  scheme,  the  number  of  pointer-key  elements  in  an  index 
record  is  constrained  by  a  parameter  k,  where 

k  S  number  of  pointer-key  elements  per  index  record  S  2k-l  (1) 

Inserting  a  data  record  into  the  database  will  cause  a  new  pointer-key 
element  to  be  added  to  the  appropriate  leaf  index  record.  However,  to 
enforce  (1),  adding  a  new  element  to  an  index  record  where  2k-l  elements 
are  already  in  the  record  will  cause  the  record  to  be  split  into  two. 
This  in  turn  will  cause  the  higher-level  index  record  to  be  updated, 
which  may  in  turn  cause  the  higher-level  index  record  to  be  split,  and 
the  effect  may  ripple  to  the  highest  level  index  record.  Conversely, 
deletion  of  a  record  will  cause  an  element  to  be  deleted  from  the  appro- 
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priate  leaf  index  record,  which  may  in  turn  cause  a  merge  of  this  index 
record  with  a  neighboring  index  record  to  enforce  (l),  and  so  on. 

Dynamic  balancing  eliminates  the  kind  of  performance  degradation  a 
static  scheme  may  suffer  from  as  database  evolves,  and  more  importantly, 
eliminates  the  need  for  periodic  reorganization  of  the  database.  Howev¬ 
er,  the  prospect  of  updating  the  index  records  results  in  the  need  to 
synchronize  all  transactions'  accesses  to  the  index  area,  a  requirement 
not  needed  in  the  static  scheme  since  in  the  static  scheme  the  index 
area  is  read-only  for  all  transactions.  Since  the  index  area  is  compar¬ 
atively  small  and  is  to  be  accessed  by  a  large  number  of  transactions, 
it  is  an  area  of  potential  threat  to  level  of  concurrency  in  the  data¬ 
base.  That  is,  the  likelihood  that  two  transactions  that  do  not 
interfere  at  the  data  record  level  but  interfere  at  the  index  record 
level  may  cause  serious  degradation  in  the  level  of  concurrency  in  the 
system. 

Several  locking-based  approaches  have  been  proposed  to  alleviate  the 
problem.  Two  most  recent  works  of  particular  relevance  are  the  struc¬ 
tural  locking  approach  <Kedem7S,  Silberchatz80>  and  the  3-link  tree 
approach  <Lehman82>.  The  first  approach  aims  at  reducing  the  amount  of 
time  the  locks  on  the  high-level  index  records  must  be  held  by  each 
transaction.  However,  it  still  involves  lock  and  unlock  protocols  for 
each  access  cc  the  index  area,  and  the  scheme  becomes  fairly  complicated 


when  generalized  to  a  multi-key  environment  or  when  a  transaction 
accesses  more  than  one  data  records  (i.e.,  multi-step  transactions.) 


The  B-link  tree  approach  is  an  interesting  scheme  which  recognizes 
that  the  basic  difficulty  of  dynamic  balancing  is  the  requirement  to 
immediately  propagate  udpates  to  the  higher-level  index  records,  i.e., 
an  update  to  the  data  record  and  all  the  associated  updates  to  the  index 
records  would  have  to  be  atomic.  3y  slightly  relaxing  this  requirement, 
which  basically  results  in  allowing  the  index  structure  to  be  temporar¬ 
ily  in  an  'imperfect*  yet  correct  state,  concurrency  can  be  enhanced. 


The  scheme  that  we  will  describe  which  employs  the  HTS  algorithm  is 
complementary  to  the  B-link  tree  scheme.  The  B-link  tree  scheme,  while 
being  very  specialized  (i.e.,  applicable  to  B-tree  operations  only)  and 
is  not  generalized  for  handling  multi-key  indexes  and  multi-step  trans¬ 
actions,  does  reveal  a  philosophy  of  optimization  of  concurrency  control 
methods  shared  by  the  HDD  approach:  delay  as  much  as  possible  propa¬ 
gation  of  updates  until  a  more  convenient  moment  while  not  compromising 
the  goal  of  presenting  the  database  to  all  transactions  in  a  correct 
state.  Therefore  our  scheme  can  be  considered  a  scheme  that  is  built  on 
top  of  the  general  philosophy  of  the  B-link  tree  scheme  and  further 
optimizes  the  synchronization  procedures  through  a  different  channel. 


9.1.2  APPLYING  THE  HTS  ALGOS ITHM 


We  assume  that  the  E-link  tree  approach  is  taken  to  eliminate  the 
need  for  ar.  update  transaction  to  be  immediately  concerned  about  updat¬ 
ing  the  index  records  higher  than  the  leaf  index  records.  This  can  be 
done  by  giving  the  leaf  index  records  temporary  fixes  after  a  merge  or  a 
split  such  that,  with  the  old  higher  level  index  record,  the  desired 
leaf  index  record  could  still  be  obtained.  The  exact  nature  of  this 
scheme  is  fairly  involved  and  is  not  directly  related  to  the  use  of  the 
HTS  algorithm,  and  therefore  will  not  be  detailed  here.  However,  we 
assume  that  the  leaf  index  records  and  the  data  records  are  targets  of 
atomic  updates.  That  is,  the  effect  of  updating  a  data  record  that 
results  in  a  leaf  index  record  being  updated  would  be  made  known  to  oth¬ 
er  transactions  together  with  changes  in  the  leaf  index  record.  No 
partially  updated  state  (i.e.,  data  record  updated  but  the  leaf  index 
record  not  updated,  or  vice  versa)  is  to  be  seen  by  any  other  trans¬ 
action.  Under  this  arrangement,  we  are  concerned  therefore  mainly  with 
the  data  segment  representing  the  leaf  index  area  and  that  representing 
the  data  record  area. 


The  relevant  data  segment  graph  (DSG)  and  the  data  segment  hierarchy 
(DSH)  in  applying  the  HTS  algorithm  to  the  dynamic  indexing  problem  are 
shown  in  Figure  36.  As  shown  in  the  figure,  the  database  is  decomposed 
into  two  data  segments.  The  higher  level  data  segment  (D,)  is  the  index 
area,  and  the  lower  level  data  segment  (D;)  is  the  data  area.  With  this 


database  partition,  two  classes  of  transactions  are  identified: 


(1)  Transaction  class  1  (T,):  Transactions  that  involve  updates  to 

one  of  the  indexed  fields.  These  transactions  have  to  write 
into  both  the  index  area  and  the  data  area,  therefore  are  trans¬ 
actions  that  are  rooted  in  D,  (the  higher  level  data  segment) 

and  is  the  class  responsible  for  the  D,  ->  D2  arc  in  the  data 

segment  graph. 

(2)  Transaction  class  2  (T2) :  Transactions  that  do  not  involve 

udpates  to  any  of  the  indexed  fields.  These  transactions  have 
to  write  into  only  the  data  record  area,  and  need  only  to  read 
from  the  index  area.  Therefore  these  are  transactions  that  are 
rooted  in  D2  (the  only  data  segment  it  wirtes  into)  and  is 

responsible  for  the  arc  D2  ->  Di  in  the  data  segment  graph. 

Obeying  the  HTS  protocols,  accesses  to  data  granules  from  a  trans¬ 
action  will  be  synchronized  by  the  rules  spelled  out  as  follows: 

(1)  Read(d,t)  where  d  is  in  Dt  and  t  is  in  Ti:  Grant  access  to  the 

most  recent  version  of  d  before  I(t).  Timestamp  the  accessed 
version  of  d  with  a  read  timestamp  with  value  I(t). 

(2)  Write(d,t)  where  d  is  in  Di  and  t  is  in  T,:  Check  if  the  most 

recent  version  of  d  has  been  stamped  by  any  time  greater  than 
I(t).  If  so,  abort  t.  Otherwise,  create  a  new  venon  of  d  with 
version  timestamp  I(t). 

(3)  Read(d,t)  where  d  is  in  D2  and  t  us  in  T,:  Grant  access  to  the 
most  recent  version  of  d  before  I(t)  *  c,  where  q  is  a  constant 
predetermined  by  the  system  for  ail  transactions  in  class  I, 


such  that  the  probability  that  a  transaction  in  class  T  ,  would 
last  longer  than  q  is  very  small.  Timestamp  the  accessed  data 
version  with  read  timestamp  of  value  I(t)  +  c. 

(4)  Write (d,t)  where  d  is  in  D2  and  t  is  in  T,:  Check  if  the  most 

recent  version  of  d  has  been  stamped  by  any  time  greater  than 
I(t)  +  q.  If  so,  abort  t.  Otherwise,  create  a  new  verion  of  d 
with  a  version  timestamp  of  I(t)  +  q. 

(5)  Read(d,t)  where  d  is  in  D(  and  t  is  in  T2:  Grant  access  to  the 
most  recent  version  of  d  before  A21(I(t)),  i.e.,  the  initiation 
time  of  the  oldest  active  transaction  in  transaction  class  Ti  at 
time  I (t) . 

(6)  Read(d,t)  where  d  is  in  D2  and  t  is  in  T2:  Grant  access  to  the 

most  recent  version  of  d  before  I  (t) .  Timestamp  the  accessed 

data  version  with  read  timestamp  of  value  I (t) . 

(7)  write (d,t)  where  d  is  in  D2  and  t  is  in  T2:  Check  if  the  most 

recent  version  of  d  has  been  stamped  by  any  time  greater  than 
I(t).  If  so,  abort  t.  otherwise,  create  a  new  verion  of  d  with 
a  version  timestamp  of  lit). 

The  above  seven  categories  of  accesses  are  summarized  in  the  follow- 
ng  table  where  each  access  is  gtven  the  protocol  name  used  (i.e.,  ?rc- 
ocol  E,  H  or  L.) 

.1.2.1  AN  EXAMPLE 


t«  (It 


As  an  example,  consider  the  following  database  and  a  set  of  trans¬ 
actions  to  be  executed  concurrently: 

Data  record:  <Social  Security  Number  (SSN) ,  income  level>. 

Keyed  field:  SSN 

tl:  Insert  record  <7048,  l0fc>. 

t2 :  Update  income  level  of  the  record  with  SSN  =  6337. 

t3 :  Update  income  level  of  the  record  with  SSN  =  7059. 

A  relevant  snapshot  of  the  database  before  these  three  transactions  are 
run  is  shown  in  Figure  38.  Let  the  most  recent  versions  in  the  snapshot 
of  all  records  involved  in  the  three  transactions  be  before  I(ti),  and 
let  I(tl)  <  I (c2)  <  I (t3 ) .  Now  we  show  a  sequence  of  interleaved  exe¬ 
cution  of  these  transactions  under  control  of  the  HTS  algorithm: 

tl;  access  record  A  and  leave  a  read  timestamp  I(t)  with  A. 
t2:  access  recrod  A. 

t3:  access  record  A. 

tl:  after  creating  a  new  record  where  SSN  =  7048  with  a  version 
timestamp  =  I(tl)  +  q,  create  a  new  version  of  the  record  A 
with  a  version  timestamp  =  l(tl). 
t2:  create  a  new  version  of  the  record  B  with  version  timestamp 
=  I (t2)  . 

t3:  create  a  new  version  of  the  record  C  with  version  timestamp 
=  I (t3) . 

Ail  uhese  transactions  in  our  example  conflict  on  the  index  record  A 
but  do  not  conflict  at  the  data  level.  However,  with  the  HTS  algorithm, 


all  three  transactions  are  processed  concurrently  without  inducing  any 


delay  or  restart.  Moreover,  t2  and  t3's  accesses  to  the  record  A  do  not 
entail  any  effort  to  timestamp  the  accessed  record.  The  same  sequence 
of  actions,  if  controlled  by  the  conventional  KVTS,  will  result  in  t2 


and  t3  leaving  read  timestamp  with  record  A  and  also  in  tl  being  aborted 
and  restarted;  if  controlled  by  the  two  phase  locking  algorithm,  will 
result  in  t2  and  t3  being  blocked  until  tl  is  finished  in  addition  to 
having  every  record  acceses  bracketed  in  the  overhead  of  locking  and 
unlocking. 


An  intuitive  explanation  for  the  above  scheme  to  work  is  that  the  HTS 
algorithm  has  the  net  effect  of  giving  a  preferential  treatment  to  the 
index  area.  As  shown  in  our  example,  instead  of  allowing  t2  and  t3  to 
directly  conflict  with  tl,  which  belongs  zo  the  class  Ti,  it  made  t2  and 
t3  to  appear  to  tl  as  earlier  transactions,  even  though  t2  and  t3  actu¬ 
ally  started  after  tl.  Transactions  t2  and  t3  therefore  are  simply 
granted  'earlier'  versions  and  will  not  leave  traces  that  would  have 
caused  tl  zo  abort.  We  will  in  the  next  subsection  present  a  mere  ana¬ 
lytical  explanation  of  the  forces  underlying  our  simple  example. 


9.1.3  AN  ANALYSIS  USING  A  SIMPLE  ANALYTICAL  MODEL 


we  now  present  an  analysis  to  quantify  the  effect  of  the  HTS  algo¬ 
rithm  on  performance  improvement,  in  terms  of  level  of  concurrency,  in 
our  dynamic  indexing  case.  The  analysis  is  based  on  a  simple  analytical 


model  of 


the  behavior  of  transactions  synchronized  by  the  two  phase 


locking  algorithm  and  the  timestamping  algoirthm. 

9. 1.3.1  MODELLING  APPROACH 

Due  to  the  complexity  of  the  activities  involved  in  a  transaction 
processing  system  under  different  concurrency  control  methods,  the  pres¬ 
ent  analysis  employs  a  number  of  assumptions  to  simplify  the  problem. 
The  alternative  method  of  study  would  have  been  simulation,  which  would 
have  been  both  costly  and  relatively  inefficient  in  providing  intuitive 
pictures.  Therefore,  for  the  purpose  of  obtaining  mean  values  of  per¬ 
formance  measures,  the  present  analytical  approximation  approach  is 
chosen.  The  emphasis  of  comparison  is  on  concurrency  control  related 
overhead,  the  relavant  measures  being  frequencies  of  blocking  and 
restarts  due  to  data  contention. 

The  classic  computer  system  performance  evaluation  studies  normally 
concentrate  on  the  impact  of  contention  for  computer  system  resources, 
such  as  CPU  and  I/O  devices,  on  the  performance  of  the  system  in  terms 
of  response  time  and  throughput.  Relatively  little  attention  has  been 
paid  to  evaluating  the  impact  of  data,  contention  on  the  performance  of  a 
database  system.  To  some  extent,  while  the  notions  of  a  system  being 
CPU-bound  or  I/O-bound  are  well  established  in  the  field  of  computer 
system  performance  evaluation,  the  notion  of  data- bound,  or 
concurrency  - b ound ,  is  yet  to  be  established.  It  is  precisely  this 
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dimension  of  performance  limitation  that  the  concurrency  control  algo¬ 
rithms  are  to  address. 

To  conentrate  on  evaluating  the  impact  of  data  contention  rather  than 
contention  for  other  system  resources,  we  propose  to  conceal  the  details 
of  the  computer  systems  (CPU,  number  of  I/O  devices,  etc.)  by 
parameterizing  the  mean  computer  service  time  for  each  granule  request 
to  be  a  constant  1/u.  This  implies  that  the  computer  system  resources 
are  adequately  supplied  such  that  marginal  increases  in  load  (i.e.,  size 
of  transactions  or  number  of  terminals)  will  not  cause  significant  con¬ 
tention  for  computer  resources  to  occur,  and  therefore  maintaining  the 
computer  service  time  per  granule  at  l/u.  Equivalently,  we  postulate  a 
transaction  processing-  system  in  which  the  entire  computer  system  is 
abstracted  into  a  single  'infinite'  server  with  a  constant  mean  service 
time  per  granule  access  request.  We  then  derive  expressions  for  per¬ 
formance  measures  for  this  transaction  processing  system  as  functions  of 
parameters  of  concern  to  concurrency  control,  such  as  transaction  size, 
level  of  multi-programming,  size  of  read  sets  and  write  sets,  database 
size,  etc.,  under  each  of  the  concurrency  control  algorithms  to  be 
studied.  In  short,  this  approach  helps  isolating  the  effect  of  data 
contention  from  that  of  computer  system  resource  contention  and  is  espe¬ 
cially  useful  in  comparative  evaluation  of  different  concurrency  control 


methods 


As  a  short  digression,  this  philosophy  of  separating  data  contention 
from  computer  system  resource  contention  is  not  entirely  new.  Some  ear¬ 
ly  wor)c  in  performance  evaluation  of  concurrency  control  using  simu¬ 
lation  (<Spitzer76>,  <Mur.tz77>,  <Ries77>,  <Ries79>)  did  not  make  this 
distinction.  However,  the  set  of  simulation  studies  conducted  in 
(<Lin82a>,  <Lin82b>,  <Lin82c>)  specifically  separates  the  effect  of  con¬ 
tention  for  computer  systems  from  that  of  data  contention.  In 
particular,  in  <Lin82b>,  a  simulation  of  the  two  phase  locking  concur¬ 
rency  control  method  is  conducted  which  provides  interesting  evidences 
that  the  performance  of  the  two  phase  locking  method  depends  very  lttle 
on  the  variance  of  the  granule  service  time.  This  gives  some  support 
for  the  validity  of  studying  data  contention  as  an  entity  separate  from 
contention  for  computer  systems.  Among  the  analytical  work,  <Potier80> 
studied  the  effects  of  two  phase  locking  method  on  performance,  employ¬ 
ing  a  two-level  model  in  which  the  computer  system  resource  contention 
is  modeled  at  one  level  while  the  data  contention  and  concurrency  con¬ 
trol  are  modeled  at  the  other  level,  and  the  global  performance  is 
derived  by  combining  the  two.  A  similar  approach  was  taken  in 
<Menasce82>  for  analytically  evaluating  the  performance  of  the 
timestamping  algorithm.  Analytical  models  that  concentrate  on  data  con¬ 
tention  are  reported  in  <Shum81>,  <Chesnais83> ,  <GraySl>  and 
<Goodman83>.  Our  purpose  of  modelling  here,  however,  is  aimed  at  pro¬ 
viding  simple  formula  that  are  intuitively  useful  in  assessing  the 
impact  of  the  concurrency  control  methods  on  performance  of  a  segmented 
database.  Therefore  we  rely  on  deriving  functional  relationships  among 
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system  parameters  and  performance  measures  at  the  steady  state  tc  shed 
light  on  the  key  differences  of  the  concurrency  control  methods  under 
study. 

Under  this  approach,  we  postulate  a  model  of  transaction  behavior 
under  two  phase  locking  as  shown  in  Figure  39(a)  and  a  model  of  that 
under  multiversion  timestamping  in  Figure  39(b).  In  general,  a  trans¬ 
action  begins  by  requesting  the  first  granule.  Under  two  phase  locking, 
the  request  may  be  blocked  and  the  transaction  put  in  the  block  queue 
until  it  is  reactivated.  Under  multiversion  timestamping,  however,  the 
request  is  always  granted.  Once  the  request  is  granted,  the  transaction 
spends  on  the  average  i/u  units  of  time  in  accessing  and  processing  that 
granule.  This  time  includes  all  CPU  and  I/O  time  needed  for  processing 
the  granule  and  waiting  time  for  CPU  and  I/O  devices.  The  transaction 
then  proceeds  to  request  the  second  granule,  and  repeats  this  process 
until  it  is  finished.  When  a  transaction  has  finished  processing  all 
its  granules,  it  enters  its  commit  phase  whose  length  is  determined  by 
the  number  of  granules  the  transaction  has  to  write.  Under  2PL,  the 
transaction  then  leaves  the  system  as  successfully  completed.  Under 
XVTS  or  HTS ,  however,  a  transaction  would  have  to  be  atomically  vali¬ 
dated  for  successful  completion.  This  means  that  a  transaction  could  be 
aborted  and  leaves  the  system  as  unsuccessfully  finished. 

We  assume  that  the  transactions  in  the  same  class  behave  in  a  homoge¬ 
neous  way,  meaning  that  all  transactions  in  the  a  class  T,  access  the 


sane  number  of  data  granules  in  each  data  segment.  We  denote  the  number 
of  granules  of  data  segment  Dj  accessed  (or  written)  by  a  transaction  in 
class  Ti  as  a^j  (or  w1tj)  The  size  of  a  transaction  is  equivalent  to 
the  total  number  of  granules  it  accesses.  We  denote  the  size  of  trans¬ 
actions  in  class  Tj  as  a(.  We  also  assume  that  all  granules  within  the 
same  data  segment  have  the  same  likelihood  of  being  accessed  in  an 
access  request  directed  to  that  data  segment. 

In  modeling  the  bahavior  of  the  2PL  algorithm,  we  ignore  the  deadlock 
issue  to  render  the  model  tractable,  assuming  that  its  impact  on  the 
overall  performance  is  negligible,  and  we  are  mainly  concentrating  on 
the  algorithm's  blocking  bahavior.  This  assumption  is  very  likely  to  be 
violated  in  real  systems.  However,  this  assumption  is  in  favor  of  the 
2PL  approach,  and  is  a  conservative  assumption  when  the  purpose  of  the 
study  is  to  identify  scenarios  under  which  HTS  can  perform  better  than 
2PL. 

Finally,  we  assume  that  at  steady  state,  the  expected  number  of 
granules  already  processed  by  a  transaction  in  the  system  is  a  fixed 
proportion  s  of  the  size  of  the  transaction.  (The  size  of  a  transaction 
is  given  by  the  parameter  a,  for  transactions  in  class  T,.)  That  is,  if 
one  were  to  pick  at  random  a  transaction  in  class  Ti  currently  in  a  sys¬ 
tem  at  its  steady  state,  then  the  expected  number  of  granules  already 
accessed  by  this  transaction  can  be  expressed  as  s  *  a,,  where  s  is  a 
constant  between  0  and  1.  This  assumption  plays  an  important  role  in 


simplifying  the  computation  of  transition  probabilities  in  both  the  2PL 
system  and  the  HTS  system. 

Given  the  above,  Figure  40  provides  a  summary  listing  of  the  parame¬ 
ters  that  we  will  be  working  with. 

9. 1.3. 2  DERIVATION  OF  PERFORMANCE  MEASURES 

TWO  PHASE  LOCKING  MODEL:  We  will  first  compute  the  rate  of  conflicts 
in  the  transaction  system  under  the  2PL  algorithm.  To  do  this,  one  must 
derive  the  expected  number  of  granules  locked  and  that  write-locked  at 
any  time  in  both  D,  and  D2.  Let  these  numbers  be  denoted  as  g1(  g2»  gw, 
,  gw2,  where  g,  denotes  the  expected  number  of  granules  locked  in  D,  and 
gw,  denotes  the  expected  number  of  granules  write-locked  in  D,,  and  so 
on.  These  numbers  can  be  approximated  from  the  number  of  transactions 
in  the  system  and  the  parameter  s,  the  expected  proportion  of  the  number 
of  granules  processed  by  a  transaction  in  the  system: 
g,  =  (MP,  «  a,  j  +  MP2  *  a2,,)  *  s 
gw,  =  (MP,  *  w,t,  +  MP 2  *  w2.,)  «  s 

Since  we  have  assumed  that  data  granules  within  a  data  segment  have  the 
same  likelihood  of  being  accessed  in  any  request,  the  probability  that  a 
read  request  from  a  cransaction  in  class  T,  for  a  granule  in  D,  is 
rejected  can  be  given  as: 

pread(l, i)  =  (gw,  -  v; ,  ,  ,  *  s)  /  j D ,  j 


(s  /  |  D ,  j )  *  ((Mr,  -  1)  *  w  <  ,  «■  MP2  *  w  2  , ) 


Similarly,  the  probability  that  a  write  request  from  a  transaction  in  T, 
for  a  granule  in  D,  is  rejected  is  given  as: 

pwrite(i,i)  =  (g<  -  a1t,  »  s)  /  1 D 1 ) 

Now,  to  compute  the  rate  of  blocking,  one  must  also  find  an  expression 
for  the  rate  of  read  and  the  rate  of  write  requests  from  each  trans¬ 
action  class  to  each  of  the  data  segment.  These  rates  are  functions  of 
the  throughput  of  the  system.  Let  the  expected  response  times  of  trans¬ 
actions  in  classes  T1  and  T2  be  denoted  as  R.  and  R2.  The  throughput  of 
the  transactions  in  class  T , ,  denoted  as  throughput 1 ,  according  to  Lit¬ 
tle's  law,  is  given  as 

throughput  1  ■  HP i  /  R 1 

The  rate  of  read  requests  from  transaction  class  T j  to  data  segment  D* 
is  therefore  given  as 

rrate(j,i)  =  throughput i  *  aj , *  /  a; 

Similarly  one  can  derive  the  rate  of  write  request  wrate(j,i).  Multiply 
the  rate  of  requests  by  the  probability  of  conflict  per  request  one 
derives  the  rate  of  conflict,  or  the  rate  of  blocking,  in  the  system  as 
follows: 

Rate  of  blocking  due  to  accesses  to  D,  = 

(s  /  |D,|)  *  (  I(j)  (pread(j,i)  *  rrate(j,i)  +  pwrite(j,i) 
*  rwrite(j,i) ) ) 

When  |D,|  <<  |d2|.  which  amounts  to  saying  that  the  potential  contention 
in  D .  would  be  much  higher  than  that  in  D2,  we  want  to  concentrate  on 
the  rate  of  blocking  due  to  accesses  to  C,,  which  is  expected  to  domi¬ 
nate  that  due  to  accesses  to  D2.  Since  no  transactions  in  T?  writes  in 


we  have  w2i1  =  0.  In  addition,  in  our  case,  a,  =  a2,  a,.,  =  a2i1 


Therefore,  We  give  the  rate  of  blocicing  in  D1  in  our  case  as  follows: 


C  *  (l/R i )  «  {2.MP1«(MP1  -1)  +  (1  +  e^MP^MP^ 


(2PL-1) 


where  e  >  1  is  the  ratio  between  R,  and  R2,  i.e.,  e  =  R,  /  R2, 


C  ■  (s  *  a1t1  *  w,,,)  /  ( i  D !  |*  a,) 


What  is  interesting  about  this  expression  for  approximating  the  rate 
of  blocicing  due  to  contention  is  D,  is  that  it  is  sensitive  to  the  num¬ 
ber  of  transactions  in  both  classes  in  the  system.  Therefore  when  D,  is 
heavily  contended  for  and  MP2  is  much  greater  than  MP i ,  we  expect  the 
system  to  suffer  from  a  high  rate  of  blocicing,  much  of  it  would  be  spent 
in  blocking  transactions  in  class  T2,  those  transactions  that  need  only 


to  read  D ,  and  do  not  even  update  D 


Now  we  turn  to  deriving  the  rate  of  restarts  under  the  Hierarchical 


Timestamp  Algorithm. 


HIERARCHICAL  TIMESTAMP  MODEL:  The  major  performance  characteristics 
of  the  HTS  algorithm  is  the  rate  of  abort,  i.e.,  the  rate  of  restarting 
transactions  due  to  invalidated  accesses.  We  again  assume  that  m  the 
case  under  study  D,  is  much  more  contended  than  D2,  and  therefore  the 
rate  of  aborting  transactions  due  to  invalidated  accesses  to  granules  in 
D i  dominates  that  in  D2.  We  are,  as  a  result,  interested  in  deriving  the 
rate  of  aborts  due  to  contention  in  Di. 
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First  we  derive,  for  a  particular  transaction  t  in  class  T , ,  the 
expected  number  of  data  granules  in  D,  that  are  timestamped  by  trans¬ 
actions  that  are  started  later  than  t  when  t  enters  its  atomic  vali¬ 
dation  phase.  This  number,  denoted  as  d,(i),  divided  by  the  total 
number  of  data  granules  in  Di,  is  the  probability  that  a  write  request 
from  t  to  a  data  granule  in  Di  would  be  invalidated  during  the  vali¬ 
dation  phase.  Therefore,  the  probability  that  a  transaction  in  class  T < 
would  be  invalidated  due  to  requesting  to  write  in  Di  and  as  a  result 
would  be  aborted,  is  given  as 

P i (i)  =  1  -  (1  -  (di(i)  /  | D 1 1 ) ) w 1 • 1  ,  (1) 

Since  in  our  case,  Di  t>  d2,  therefore  w2i1  =  C.  Therefore,  under  HTS , 
only  transactions  in  class  T  i  could  be  aborted  due  to  invalidated  write 
requests  for  granules  in  Di,  while  transactions  in  T2  will  never  -be 
aborted  due  to  contention  in  Di. 

Now  we  proceed  to  derive  di(l),  the  expected  number  of  data  granules 
timestamped  by  times  later  than  I ( t )  when  transaction  t  in  T i  enters  its 
atomic  validation  phase.  To  compute  this,  we  first  derive  the  expected 
elapse  time  between  the  initiation  time  of  t  in  T<  and  the  time  when  t 
enters  its  atomic  validation  phase.  This  time,  however,  is  simply  the 
total  granule  processing  time  of  the  transaction,  ai/u,  since  under  the 
KTS  algorithm  no  time  of  a  transaction  is  spent  in  the  blocX  state. 

Next  we  derive  the  expected  number  of  transactions  in  class  T,  that 
are  started  during  this  period  of  time.  This  is  given  as  ai/u  *  (M?, 
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-i)  /  ((a,  +  w,)/u),  since  MP ,  /  ((a,  +  w,)/u)  gives  tne  rate  of  start¬ 

ing  transactions  in  Tj.  Finally,  we  derive  3,(1) ,  the  expected  number  of 


granules  in  D 1  timestamped  by  later  transactions  when  t  in  T,  enters  its 
validation  phase.  Since,  under  HTS,  no  transactions  in  class  T2  would 
timestamp  data  elements  in  D,,  (recall  that  all  transactions  in  T2  use 
protocol  H  to  access  D 1 ,  which  does  not  require  any  timestamping),  d, 
(1)  will  depend  solely  on  the  rate  of  starting  transactions  in  and 
the  expected  stage  at  which  these  transactions  started  after  t  would  be 
in  when  t  enters  its  validation  phase.  The  latter  is  determined  by  the 
parameter  s.  We  therefore  derive  an  approximation  of  di(l)  as  follows: 

d1(l)  ~=  (a,/u)  *  (M? ,/ ( (a ,+w  ,)/u)  *  s  »  a,.,  (2) 

This  formula  in  fact  gives  an  upperbound  for  d,(l)  since  it  makes  the 
assumption  that  there  is  no  overlap  of  granules  timetamped  by  the  young¬ 
er  transactions.  Therefore  we  must  keep  in  mind  that  the  resulting 
measures  of  rate  of  abort  would  tend  to  be  overstated. 

Plugging  (2)  into  (I) ,  we  derive  the  probability  that  a  transaction 
in  T  i  would  be  aborted  due  to  invalidated  write  request  for  granules  in 
D  i : 

P,(i)  =  i  -  {l  -  ((MP,  -1) *s*a  , ,  ,*a  , )/ ( | D ,  | *  (a  ,*w , ) ) } w ' • 1 

The  rate  of  aborting  transactions  due  to  contended  accesses  to  D,  is 
therefore  P,(l)  multiplied  by  the  rate  of  starting  transactions  in  class 
T,,  and  can  be  approximated  by  the  following  upperbound: 

C  *  (a ,/ (a ,+w  ,) ) 2  *  u  *  (MP  ,* (MP  ,  -  1) )  (HTS-i) 

where  C  is  defined  the  same  way  as  in  (2PL-1) . 
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Compare  (ETS-1)  with  (2PL-1) ,  one  discovers  the  difference  that  (HTS-i) 
does  not  depend  on  the  level  of  multi-programming  on  throughput  of  the 
transactions  in  class  T2.  Therefore,  if  the  rate  of  blocking  due  to  con¬ 
tention  in  Di  is  the  source  of  performance  degradation  under  2PL  and  the 
rate  of  restarting  is  that  under  HTS ,  HTS  clearly  has  the  potential  of 
offering  an  advantage  over  2PL,  especially  in  cases  where  the  throughput 
requirement  for  transactions  in  class  T2  is  much  higher  than  that  in  T,. 

CONCLUSION:  In  sum,  we  have  adopted  a  very  simple  analytical  model 
to  capture  the  essence  of  the  performance  of  the  2-data  segment  case 
under  the  two  phase  locking  algorithm  and  the  hierarchical  timestamp 
algorithm.  The  model  makes  assumptions  that  are  in  general  in  favor  of 
the  2PL  approach  and  discriminates  against  the -HTS  approach.  The  pur¬ 
pose  is  to  demonstrate  the  effect  of  the  HTS  algorithm  in  relieving 
contention  in  the  higher  data  segment,  presumably  the  much  more  heavily 
contended  data  segment  in  our  case.  The  result  shows  that,  in  general, 
the  rate  of  blocking  under  2PL  is  proportional  to  MP , 2  +  MP ,  *  M?2.  In 
contrast,  the  rate  of  abort  under  HTS  is  proportional  only  to  MP,2. 
While  the  absolute  values  depend  on  other  parameters,  one  can  readily 
conclude  that  KTS  is  preferred  when  M?2  >>  MP,. 

S.2  DESIGN  OF  A  BANKING  DATABASE 
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In  this  section  we  apply  the  hierarchical  database  decomposition 
approach  to  the  design  of  a  banking  application.  As  discussed  before, 
the  hierarchical  approach  to  structuring  the  database  components  offer 
advantages  additional  to  reducing  data  contention  and  synchronization 
overhead.  It  also  encourages  transaction  and  database  designs  that  take 
advantage  of  the  pipelining  operations  that  may  be  intrinsic  to  an 
application,  and  results  in  designs  that  are  more  efficient,  better 
structured  and  more  resistant  to  localized  failures.  While  the  main 
purpose  of  the  current  case  study  is  to  demonstrate  the  existence  of 
applications  that  can  be  structured  hierarchically  and  therefore  are 
candidate  applications  where  the  hierarchical  timestamping  algorithm  can 
be  employed  to  reduce  contention  and  synchronization  overhead,  this  case 
is  also  chosen  so  as  to  exemplify  the  structuring  effect  of  the  HDD 
approach. 

A  synopsis  of  this  section  is  as  follows.  We  will  in  the  first  part 
provide  a  description  of  the  current  operation  of  the  banking  system 
under  study.  The  current  system,  considered  one  of  the  most  advanced  in 
what  it  does  and  is  quite  sophisticated  and  complex,  is  under  review  by 
the  bank  to  find  ways  to  improve  on  it  so  as  to  better  meet  the  demand 
on  the  system  as  one  of  the  leading  electronic  banking  systems.  The 
current  system  is  composed  of  a  number  of  independent  systems  that  have 
been  developed  separately  fcr  different  bank  product  offerings,  such  as 
loans,  funds  transfer,  cash  management,  etc.  Due  to  the  fact  that  the 
current  system  is  not  intrinsically  an  integrated  system,  it  is  ccnsid- 


ered  disadvantaged  for  dealing  with  integrated  customer  services  and  new 


product  offerings  and  with  growth  in  demand  in  services. 

To  provide  a  basis  for  integration,  a  new  architectural  design  of  the 
system  has  been  proposed.  This  new  design  fully  exploits  the  database 
system  technology  in  order  to  separate  processing  from  data,  so  that  the 
database  could  serve  as  an  integrated  source  of  information  for  all 
types  of  existing  and  future  processing.  This  conceptual  model  of  the 
new  system  will  be  briefly  described  in  the  secnd  part  of  this  section. 

Taking  the  conceptual  model  as  the  future  system  architecture,  this 
banking  system,  in  its  integrated  form,  i.e.,  as  a  database  applicaiton, 
appears  to  exhibit  a  hierarchical  structure  where  the  HDD  approach  to 
database  concurrency  control  can  be  applied.  This  hirearchical  struc¬ 
ture  can  be  derived  from  the  types  of  operations  and  transactions  sup¬ 
ported  by  the  current  system.  In  other  words,  if  we  were  to  pull  all 
the  data  now  residing,  and  perhaps  duplicated,  in  the  various  systems 
and  perform  an  analysis  on  the  access  and  update  behavior  of  the  trans¬ 
actions  on  these  data,  one  can  readily  identify  a  hierarchical  strucutre 
to  the  database  where  segments  of  a  higher  level  are  read,  but  need  not 
be  updated,  by  transactions  that  update  segments  of  a  lower  level.  This 
hierarchical  structure  of  the  database  and  its  corresponding  transaction 
types  and  classes  constitute  a  proposed  high-level  database  design  for 
the  database  of  the  new  banking  system,  and  will  be  described  in  the 
third  part  of  this  section.  This  design  has  the  salient  feature  of 

-232- 


achieving  the  effect  of  data  integration  without  concerns  over  increased 
data  contention  and  synchronization  overhead.  The  section  is  concluded 
with  the  advantages  of  the  proposed  design  in  its  final  part. 

9.2.1  CURRENT  SYSTEM 

This  description  of  the  current  system  is  mostly  extratcted  from 
<Document-A>  and  from  site  visits  and  verbal  follow-ups.  The  purpose  of 
this  description  is  to  provide  an  overview  of  the  the  structure  of  the 
current  configuration  and  the  types  of  processing  in  the  current  system 
and  use  it  as  a  basis  for  designig  an  integrated  system  in  which  infor¬ 
mation  resources  are  concentrated  in  a  well- structured  database. 

The  current  system  is  composed  of  various  systems  that  support  commu¬ 
nication,  financial  processing  (FP)  and  information  processing  (IP). 
Communication  is  performed  by  a  number  of  communication  networks 
internal  or  external  to  the  bank  entity  under  study.  The  purpose  of 
these  communication  networks  is  to  route  messages,  requests  and  activity 
information  among  the  financial  processing  systems,  the  information 
processing  systems  and  the  external  world.  Since  our  focus  of  study  is 
the  information  resources  and  processing,  we  will  not  be  concerned  with 
the  detailed  nature  of  the  communication  networks. 

Financial  processing  refers  to  the  processing  of  customers'  requests 
for  services,  namely,  transactions.  It  is  itself  supported  by  several 


different  systems,  each  of  which  supports  a  particular  type  of  financial 
service  provided  by  the  bank.  Currently,  there  are  six  systems  in  this 
group: 

(1)  The  Funds  Transfer  System 

(2)  The  Loan  System 

(3)  The  Trade  Service  System  (mostly  for  processing  letters  of 
credit) 

(4)  The  Cash  Letter  System 

(5)  The  Cash  Management  System 

(6)  The  Customer  Inquiry  and  Investigation  System 

Among  these  systems,  the  funds  transfer  system  provides  a  basic  ser¬ 
vice  that  most  other  systems  would  use,  i.e.,  transaction  processing  in 
other  systems  may  result  in  funds  transfer  requests  being  sent  to  the 
Funds  Transfer  System. 

All  the  FP  systems  also  produce  posting  requests  to  be  handed  over  to 
information  processing  (IP)  at  the  end  of  an  accounting  cycle  which  will 
use  these  posting  information  to  compile  account  status.  At  the 
begining  of  an  accounting  cycle,  the  F?  systems  also  receive  updated 
account  data  from  IP  to  be  used  as  the  basis  for  transaction  processing 
during  that  accounting  cycle.  The  logical  relationship  among  these 
financial  processing  systems  and  IP  is  shown  in  Figure  41. 


The  various  financial  processing  systems  are  currently  implemented  as 
separate  entities.  That  is,  every  system  is  a  'sealed'  system,  communi¬ 
cating  with  the  world  external  to  the  system  mainly  via  its  own  termi¬ 
nals.  Data  files  used  by  a  system  are  tightly  coupled  with  the 
processing  component  of  that  system.  There  exists  no  physical  shared 
data  files  across  systems.  The  current  physical  relationship  among 
these  financial  processing  systems  and  IP  is  shown  in  Figure  42. 

The  physical  configuration  of  the  current  system  uses  the  Service 
Managment  Center  (SMC)  approach  to  remedy  the  problems  arising  from  a 
lack  of  direct  communication  among  the  systems.  In  an  SMC,  terminals 
that  are  connected  to  various  systems  are  pulled  together  in  one  site  to 
form-  an  uniform  interface  to  the  customer  requests.  Customer  inquiries 
and  requests  come  to  this  site  through  phone  lines  or  other  communi¬ 
cation  networks  and  are  interpreted  and  processed  by  an  operator  who 
would  in  turn  route  the  request  to  the  appropriate  terminal.  Use  of  the 
service  management  centers  is  a  first  step  towards  the  realization  of  an 
integrated  customer  service. 

Information  processing  (IP)  refers  to  processing  of  custmer  and  bank 
account  information  for  accounting  and  MIS  purposes.  It  consists  of 
journal,  customer  ledger  and  bank  ledger  processing,  proofing  -  a  proc¬ 
ess  that  checks  for  consistency  among  journal  entries  and  ledgers  and  if 
necessary  generates  repair  postings  for  corrections,  MIS  processing, 
such  as  account  profitability  data  compilation,  and  risk  control  inf or- 


mation  processing,  which  computes  the  bank's  risk  exposure  to  various 
level  of  aggregated  customer  entites.  IP  accepts  posting  and  trans¬ 
action  information  from  the  financial  processing  systems.  The  customer 
account  status  information  obtained  in  IP  is  then  fed  back  to  the  finan¬ 
cial  processing  systems.  The  IP  activities  are  diagrammically  shown  in 
Figure  43 . 

Currently,  IP  is  basically  a  batch  processing  system  which  accepts 
postings  from  and  feeds  account  status  information  to  various  other  sys¬ 
tems  via  tape  handoffs  or  hardcopies.  IP  is  implemented  on  computer 
systems  physically  separate  from  financial  processing  systems,  and  can 
also  be  considered  as  a  'sealed'  system. 

9.2.2  MOTIVATION  FOR  AN  INTEGRATES  SYSTEM 


The  configuration  of  the  current  system  is  a  result  of  both  technical 
considerations  (e.g.,  see  <Matteis79>,  <Reef79>)  and  organizational  con¬ 
siderations.  Segmenting  systems  by  product  line  makes  it  easier  for 
individual  organization  units  to  plan,  develop  and  control  the  system 
that  concerns  the  product  line  for  which  the  organizational  unit  is 
responsible.  However,  changing  needs  have  made  it  essential  that  seg¬ 
mented  systems  be  strategically  integrated.  The  SMC  facility  currently 
employed  by  the  bank  is  a  way  to  emulate  an  integration  of  systems  that 


are  currently  separate. 


In  <Madnick83>,  several  scenarios  for  the  need  to  integrate  histor¬ 
ically  segregated  systems  were  discussed-  Extending  discussions  pre¬ 
sented  in  that  paper  and  the  situations  faced  by  the  bank  in  our  case 
study,  we  summarize  the  motivation  behind  integration  as  follows: 

(1)  Separate  systems  make  it  difficult  to  provide  a  consistent 
interface  to  a  customer  who  purchases  various  financial  products 
provided  by  the  bank.  For  example,  the  status  of  a  customer 
across  all  financial  products  is  not  readily  available,  since  it 
has  to  be  separately  pulled  out  from  various  systems.  An 
inquiry  into  a  customer  transaction  that  spans  product  lines  can 
not  be  serviced  efficiently. 

(2)  Dispersed  information  makes  it  awkward  to  introduce  new, 

multi-service  products  quickly.  For  example,  the  cash  manage¬ 
ment  service  now  provided  by  the  bank  which  enables  the  custom¬ 
ers  to  query  their  account  status  directly  from  their  terminals 
cannot  access  the  current  account  balances  in  the  funds  transfer 
system.  Therefore  the  account  balances  made  available  to  the 
customers  are  information  compiled  from  the  previous  accounting 
cycle.  Further  making  available  informatin  now  residing  in  oth¬ 
er  financial  processing  and  information  processing  systems  to 
customers  at  their  terminals  would  also  prove  to  be 

time-consuming . 

(3)  in  general,  various  manual  intervention  now  required  to  period¬ 
ically  obtain  information  from  one  system  for  processing  in 


another  is  both  costly  and  may  prove  to  be  a  limiting  factor  for 
potential  future  growth. 

(4)  There  are  potential  problems  associated  with  squeezing  the  cur¬ 

rent  volume  of  information  processing  work  in  a  batch  processing 
window,  a  window  whose  size  depends  on  how  long  the  financial 
systems  have  to  be  kept  on-line.  In  the  current  system,  there 
are  only  occasional  misses  since  the  batch  window  is  still  rela¬ 
tively  large.  However,  one  must  begin  to  plan  for  the  shrinking 
of  this  window  as  user  demands  it  and  an  increase  in  system  load 
(e.g.,  volume  of  transactions).  Efforts  that  move  information 
processing  such  as  ledger  updating  on-line  so  that  it  can  be 
performed  concurrent  with  financial  transactions  will  dras¬ 
tically  reduce  the  amount  of  work  that  must  be  squeezed  in  the 
batch  window.  This  amounts  to  the  need  for  IP  to  directly 

access  transaction  information  and  posting  requests  generated  by 
TP  on  a  real-time  basis. 

(5)  For  the  purpose  of  risk  control,  events  that  are  taking  place  in 
another  system  within  the  current  accounting  cycle  may  also  be 
important.  However,  this  information  is  not  available  for  deci¬ 
sion  making  when  daily  accounting  processing  is  the  only  way  for 
a  system  to  become  aware  of  events  captured  by  another  system. 

In  sum,  an  integrated  system  offers  important  strategic  advantages 
over  the  currently  segregated  configuration.  As  a  result,  an  architec¬ 
tural  design  of  an  integrated  system  has  been  proposed  <Document-C> .  In 


essence,  the  proposal  suggests  that  the  new  system  be  designed  such  that 


information  resources  are  separated  from  processing,  i.e.,  transaction 
and  information  processing  of  all  types  is  to  be  implemented  on  top  of  a 
shared  information  resources  of  the  bank  entity.  Figure  Figure  44  cap¬ 
tures  the  spirit  of  this  proposed  architecture. 

9.2.3  DESIGNING  THE  STRUCTURE  CF  THE  INTEGRATED  INFORMATION  RESOURCE 


The  above  paragraphs  provide  a  background  against  which  a  study  of 
the  structure  of  the  integrated  information  resources  in  our  case  is 
attempted.  The  goal  of  this  study  is  two-fold.  First,  a  structure  of 
the  information  resource,  combined  with  the  HDD  approach  to  database 
concurrencv  control,  can  help  integrating  data  now  dispersed  in  separate 
systems  without  increasing  data  contention.  Secondly,  this  structure 
may  be  used  as  an  architecture  for  guiding  future  detailed  database 
design  activity. 

The  study  is  guided  by  the  simple  idea  that  database  design  is  inevi¬ 
tably  interlocked  with  transaction  design,  i.e.,  DE  design  depends  on 
how  data  is  accessed,  used  and  updated  by  transactions.  In  serving 
real-time  transaction  processing,  every  effort  should  be  made  to  make 
transactions  as  short  as  possible  so  that  resources  car,  be  freed  up 
quickly  and  response  times  optimized.  This  amounts  to  deferring  as  much 
as  possible  processing  that  is  not  absolutely  necessarily  real-time. 
Much  of  a  database  often  contains  derived  information  of  one  level  or 


another.  Not  requiring  real-time  transaction  processing  to  be  also 
responsible  for  updating  all  levels  of  derived  data  not  only  reduces  the 
size  of  the  transactions,  but  also  has  the  potential  of  drastically 
reducing  synchronization  overhead  if  synchronization  methods  that  take 
advantage  of  this  feature  is  employed. 

As  a  contrasting  example,  suppose  the  database  application  system  is 
designed  in  such  a  way  that  every  event  made  known  to  the  system  would 
cause  all  elements  in  the  database  that  might  be  affected  by  the  event 
to  be  updated.  This  means  that,  for  example,  when  a  funds  transfer 
transaction  is  processed  in  a  banking  system,  in  addition  to  updating 
the  account  balances  of  the  two  parties  involved  in  the  transfer,  cus¬ 
tomer  ledger,  bank  ledger,  bank  journal,  total  bank  assets,  account 
balances  aggregated  by  various  units,  etc.,  must  also  be  updated  before 
the  transaction  commits.  This  extremity  inevitably  results  in  the 
transaction  running  for  a  long  time,  tying  up  critical  data  resources 
and  causing  many  other  transactions  to  wait  or  to  be  aborted.  Obviously 
in  reality  an  organization  can  afford  and  often  do  resort  to  a  great 
deal  of  deferred  processing  so  as  to  limit  the  scope  of  one  transaction 
and  produce  more  efficient  processing. 

The  study  begins  with  an  examination  of  the  major  transaction  and 
processing  types  in  the  current  system.  Combining  the  current  behavior 
of  the  transactions  and  a  document  concerning  general  service  require¬ 


ments  and  functional  primitives  <Document-F>,  we  identify,  for  each  type 


of  transaction  or  processing,  the  data  elemnts  or  files  in  the  inte¬ 
grated  information  resources  that  would  he  accessed  or  updated.  The 
access  patterns  of  these  transactions  are  then  analyzed  for  identifica¬ 
tion  of  a  hierarchical  segmentation  structure  that  will  help  in  reducing 
synchronization  overhead. 

The  left  two  columns  in  Figure  45  is  a  list  of  the  transaction  and 
processing  types  identified  for  the  current  system  along  with  their  data 
requirements.  Two  major  observations  are  made  which  lead  to  the  con¬ 
struction  of  a  hierarchical  segmentation  of  the  database  to  be  presented 
in  a  later  paragraph: 

(1)  There  are  three  types  of  'rislc  control  information'  that  are  to 

be  shared  by  various  types  of  transaction  processing: 

a.  Aggregate  account  balances  by  region,  country,  conglomerate 
and  industry. 

b.  Customer  risk  control  parameters  such  as  overdraft  limit  and 
credit  rating. 

c.  Customer  demand  deposit  account  balances. 

All  three  types  of  information  are  dynamically  changing. 
However,  the  aggregate  account  balances  need  to  be  updated,  or 
refreshed,  only  periodically,  and  can  tolerate  relatively  long 
delays  -  even  through  the  capability  of  the  system  to  produce  the 
most  updated  aggregate  account  balances  (e.g.,  total  bank  loan 
exposure  to  Iran)  real-time  on  demand  is  extremely  desirable. 
The  customer  risk  control  parameters  are  subject  tc  update 


on-line  by  bank  personnel  based  on  customer  requests  and  other 


(2) 


information.  The  customer  demand  deposit  account  balances,  on 
the  other  hand,  are  updated  very  frequently,  triggered  by  the 
high-volume  funds  transfer  transactions.  Since  all  three  types 
of  risk  control  information  are  accessed  by  real-time  financial 
transactions  and  subject  to  on-line  update,  a  hierarchical  struc¬ 
ture  of  the  database  of  concern  to  these  financial  processing 
which  minimizes  interference  among  the  different  types  of  trans¬ 
actions  demanding  these  information  is  desirable.  In  particular, 
the  customer  demand  deposit  account  balances,  which  are  subject 
to  high  traffic  update  from  funds  transfer  processing  and  high 
traffic  access  from  most  other  types  of  financial  transaction 
processing,  is  designated  at  a  higher  level  than  the  data  updated 
by  other  FP  transaction  processing.  This  design  makes  DOA  cur¬ 
rent  balances  available  to  many  other  financial  transaction 
processing  components  on  a  real-time  basis  without  jeopardizing 
its  concurrent  availability  to  the  funds  transfer  transactions 
themselves.  A  non-cyclic  database  partition  of  the  portion  of 
the  database  accessed  by  the  F?  processing  components  is  shown  in 
Figure  45. 

Information  processing  of  the  banking  system  is  triggered  by 
transaction  processing.  For  reasons  outlined  in  the  last  sub¬ 
section,  it  is  desirable  that  IP  be  performed  on-line  concurrent 
with  processing  cf  FP  transactions.  However,  it  is  also  clear 
that  updating  ledgers  and  journals,  etc.,  does  not  have  to  be 


hi 
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executed  as  an  integral  part  cf  the  financial  transactions  that 
have  triggered  it.  In  other  words,  a  financial  transaction  can 
he  committed  to  the  database  before  the  ledger  and  journal 
updates  triggered  by  it  are  committed.  This  amounts  to  design¬ 
ing  processes  in  IP  as  database  transactions  that  are  separate 
from  FP  but  nevertheless  need  to  read  the  transaction  records 
produced  by  the  FP  transactions.  At  the  end  of  an  accounting 
cycle,  several  other  types  of  I?  transactions  are  also  triggered 
to  produce  accounting  and  MIS  statements.  These  latter  proc¬ 
esses  amount  to  database  transactions  that  read  from  the  ledger 
and  journal  data  segment  to  produce  other  derived  information  in 
the  database.  A  hierarchy  of  accesses  to  these  information  is 
therefore  exhibited  by  these  processing  activities,  and  the 
•  nature  of  it  in  our  case  results  in  a  non-cyclic  database  parti¬ 
tion  of  the  portion  of  the  database  of  concern  to  the  IP 
processing,  which  is  shown  in  Figure  46. 

In  sum,  the  study  results  in  a  hierarchical  segmentation  of  the  inte¬ 
grated  information  resources  to  be  used  in  the  integrated  banking 
system.  The  data  segment  graph  (D53)  and  the  data  segment  hierarchy 
(DSK)  of  this  segmentation  are  shown  in  Figure  47  and  Figure  46  on  page 
248.  Transaction  classification  corresponding  to  this  segmentation  is 
summarized  in  the  last  column  of  Figure  45  on  page  247. 


S.2.4  ADVANTAGES  OF  THE  PROPOSED  HIERARCHICAL  SEGMENTATION  OF  THE  DATA¬ 


BASE 


We  have  in  this  section  provided  a  case  study  that  demonstrates  the 
existence  of  hierarchical  structure  of  database  applications  that  can 
benefit  from  the  HDD  approach  to  concurrency  control.  We  list  the  fol¬ 
lowing  advantages  of  the  proposed  hierarchical  segmentation  of  the  bank¬ 
ing  database  under  study: 

(1)  Automate  all  manual  inter-system  communication. 

(2)  Eliminate  duplication  of  customer  information  for  risk  con¬ 
trol  and  the  historical  database  for  inquiry. 

(3)  Enable  the  system  to  operate  in  an  environment  where  on-line 
processing  is  close  to  24-hour  per  day  and  the  batch  window 
is  correspondingly  squeezed. 

(4)  Enable  the  system  to  operate  with  much  more  timely  informa¬ 
tion. 

(5)  Achieve  the  above  through  integrating  information  now  resid¬ 
ing  in  separate  systems  into  a  shared  data  rosources  with 
virtually  no  increase  in  data  contention  and  very  little 
increase  in  synchronization  overhead. 


9.3  FUNCTIONAL  DECOMPOSITION  IN  A  MULTI -PROCESSOR  DATABASE  COMPUTER 


In  this  section  we  apply  the  HDD  approach  to  concurrency  control  to 
the  design  of  a  database  computer  aimed  at  large-scale  multiprocessing. 
The  first  part  of  this  section  introduces  the  architecture  of  the  data¬ 
base  computer  INFCPLEX  under  study,  together  with  the  motivation  for  the 
chosen  architecture  or.  comparison  with  the  approaches  taken  in  other 
database  and  transaction  oriented  multiprocessor  systems.  The  second 
part  describes  how  the  HDD  approach  is  closely  related  to  the  design  of 
the  Functional  Hierarchy  of  INFCPLEX. 

9.2.1  INFOPLEX  DATABASE  COMPUTER  ARCHITECTURE 

INFOPLEX  consists  of  a  Storage  Hierarchy  providing  a  very  large  data 
storage  system,  and  a  Functional  Hierarchy,  built  on  top  of  the  Storage 
Hiearchy,  responsible  for  providing  database  management  functions.  Pre¬ 
vious  work  in  INFOPLEX  has  centered  around  the  overall  hardware  archi¬ 
tecture  of  INFOPLEX  <Madnick79>  and  the  design  and  evaluation  of  Storage 
Hierarchy  <Lam79>.  A  preliminary  design  for  the  Functional  Hierarchy  is 
reported  in  <Hsu80,  Tc82,  Hsu82>.  As  shown  in  Figure  48,  the  Storage 
Hierarchy  is  comprised  of  levels  of  storage  devices  with  various  per¬ 
formance  and  cost  features.  The  high-performance  devices,  such  as  cache 
memory  and  mam  memory,  are  placed  on  the  top  (i.e.,  the  highest  level 
of  the  hierarchy,)  while  low-performance  devices  such  as  mass  storage 
systems  are  placed  at  the  bottom.  Storage  Hierarchy  supports  Functional 
Hierarchy  by  providing  a  very  large  virtual  address  space  with  a  small 
average  access  time.  The  actual  structure  of  Storage  Hierarchy  and  move- 


nent  of  data  between  levels  within  the  storage  system  are  hidden  from 


Functional  Hierarchy.  Requests  for  data  blocks  are  made  to  the  highest 
level  of  the  Storage  Hierarchy,  and  data  is  moved  automatically  between 
levels  of  the  Storage  Hierarchy. 

The  Functional  Hierarchy,  on  the  other  hand,  is  a  set  of  micro¬ 
processors  and  memory  modules  that  are  structured  into  a  number  of  proc¬ 
essing  clusters.  Each  processing  cluster  is  called  a  'level'  of 
Functional  Hierarchy,  and  it  essentially  implements  a  'layer*  of  the 
hierarchically  decomposed  D3M3  functionalities.  In  this  section  we  will 
concentrate  on  the  design  considerations  of  the  architecture  of  the 
Functional  Hierarchy  from  the  concurrency  control  perspective. 

9.3.2  THE  STRUCTURED  CLUSTERING  APPROACH  TO  MULTIPROCESSING 


The  conventional  approaches  to  memory  organization  taken  in  the 
transaction-  and  database-oriented  multiprocessing  systems  can  be  clas¬ 
sified  into  three  categories.  The  first  is  the  tightly  coupling 
approach,  exemplified  by  such  systems  as  synapse  System.  In  chis 
approach,  processors  share  the  same  address  space  and  communicate  with 
one  another  through  shared  memory.  However,  to  alleviate  contention  on 
the  linear  communication  bus  as  shown  in  Figure  49,  private  memory 
moudules  are  attached  to  processors  as  cache  memory.  Processors  obtain 
daoa  from  their  own  cache  memory  and  would  need  tc  access  the  shared 


memory  only  when  the  data  needed  is  not  in  the  cache. 


simple  and  quite  efficient. 


I 


The  tightly  coupling  approach  is 
However,  it  gives  rise  to  the  cache  consistency  problem  which  is  a 
synchronization  problem  more  primitive  than  the  database  transaction 
synchronization  problem.  The  Synapse  system  uses  hardware  locks  to  deal 
with  this  problem,  locking  up  a  page  and  making  it  unavailable  for  being 
loaded  into  a  processor's  cache  if  the  page  is  currently  in  the  cache  of 
another  processor's.  By  making  the  hardware  architecture  completely 
transparent  to  any  software  other  than  the  operating  system,  the  data¬ 
base  concurrency  control  problem  is  not  dealt  with  using  any  special 
technique.  The  disadvantage  of  this  tightly  coupling  approach  is  that 
it  has  to  employ  two  layers  of  synchronization,  one  dealing  with  the 
cache  consistency  issue  and  one  dealing  with  the  logical  transaction 
consistency  issue.  This  aspect  combined  with  a  lack  of  structure  in  the 
architecture  makes  it  potentially  unfit  for  a  very  high  degree  of  multi¬ 
processing  due  to  contention  among  processors  for  the  shared  memory 
pages  and  data  elements. 

The  second  approach  is  the  segregated  system  approach  as  exemplifed 
by  such  systems  as  Tandem  System  and  Stratus  System.  In  this  approach, 
as  shown  in  Figure  49,  every  processor  is  equipped  with  its  own  memory 
and  I/O  capability.  There  is  no  shared  memory  among  processing 
elements.  Centralized  dispatch  is  normally  the  method  used  for  distrib¬ 
uting  input  jobs  among  processing  elements,  and  control  over  shared  data 
is  also  centralized  ir.  the  processing  element  that  runs  the  database 
management  system.  This  approach  does  away  with  the  cache  consistency 


problem. 


However,  centralizing  database  management  activities  within 


one  processing  element  limits  the  capability  of  this  architecture  to 
meet  the  demand  typically  expected  on  a  very  large  database  computer. 
This  approach,  like  the  tightly  coupling  approach,  is  oriented  towards 
more  general  purpose  multiprocessing  system  than  a  database  computer. 

The  third  approach  is  the  functional  specialization  approach  exempli¬ 
fied  by  such  experimental  systems  as  DBC  <Hsiao79>.  This  approach 
emphasizes  specialized  hardware  for  certain  types  of  processing  within 
the  database  system  in  order  to  increase  throughput.  The  drawback  of 
such  approach  lies  with  the  need  for  special  hardware,  the  cost  of  which 
may  prove  to  be  an  inhibiting  factor.  However,  the  basic  philosophy 
that  DBMS  performs  functions  that  possess  structural  properties  that  can 
be  taken  advantage  of  is  a  valuable  one. 

In  sum,  the  first  two  approaches,  while  flexible  and  more  suitable 
for  running  application  software,  does  not  provide  for  an  environment 
for  accommodating  a  large  number  (in  the  order  of  hundreds)  of  micro¬ 
processors.  This  is  due  to  the  hardware  limitation  of  supporting  a 
large  number  of  processors  on  a  linear  bus  and,  more  importantly,  the 
software  limitation  of  data  contention.  They  do  not  attempt  to  exploit 
in  the  multi-processor  architecture  structural  and  functional  modularity 
that  may  be  inherent  in  database  application  systems.  The  third 
approach,  on  the  other  hand,  tends  to  tie  database  logic  with  processor 
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technology,  and  makes  it  more  costly  and  less  adaptable  to  changes  in 
either  area. 

In  contrast,  INFOPLEX  uses  the  structured  clustering  approach  to 
multiprocessing.  In  this  technique,  a  large  number  of  general  purpose 
microprocessors  are  grouped  into  clusters  called  'levels'.  As  shown  in 
Figure  50,  processors  within  a  level  are  tightly  coupled  using  the  con¬ 
ventional  method  of  a  shared  bus  (local  bus)  and  shared  memory  modules. 
Intra-level  cache  consistency  problem  is  avoided  by  allowing  only 
read-only  pages  (e.g.,  pure  code)  to  be  resident  in  private  cache 
attached  to  each  processor.  Multiple  levels  are  in  turn  linearly  cou¬ 
pled  via  a  global  bus,  but  no  shared  memory  exists  between  levels.  In 
other  words,  each  level  has  its  own  address  space  and  does  not  share  its 
address  space  with  any  other  level,  therefore  the  inter-level  memory 
consistency  problem  is  avoided.  Inter-level  communication  strictly 
takes  the  form  of  message  passing  which  is  performed  through  hardware 
(i.e,  the  gateway  controller)  without  resorting  to  inter-level  shared 
memory . 

Each  level  is  also  supported  by  the  Storage  Hierarchy.  The  Storage 
Hierarchy  implements  a  segmented  address  space  each  segment  of  which  is 
accessible  by  and  only  by  a  particular  level.  Therefore,  the  Storage 
Hierarchy  performs  the  function  of  a  common  mass  storage  utility  for  ail 
levels  without  violating  the  rule  that  no  inter-level  shared  memory  is 
allowed. 
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With  this  architecture,  the  D3MS  functionalities  are  decomposed  into 
layers  of  tasks,  each  layer  to  be  assigned  to  a  level.  Tasks  within  a 
layer  are  defined  through  the  layer's  interfaces  to  adjacent  layers,  and 
chagnes  within  a  layer  are  relatively  well  contained. 

The  structured  clustering  approach  offers  several  advantages.  First, 
general-purpose  microprocessors  are  adequate  as  processing  elements 
within  each  cluster  and  therefore  specialized  hardware  is  not  required. 
Second,  it  provides  a  framework  for  supporting  a  large  number  of 
processors  in  order  to  provide  for  the  level  of  parallelism  needed  to 
implement  a  very  large  database.  Most  importantly,  this  design  specif¬ 
ically  does  away  with  the  memory  inconsistency  problem.  Third,  it  is  a 
structured  approach  to  implementing  large  database  computer,  possessing 
the  potential  of  resulting  in  a  more  modular  and  more  reliable  design. 
We  will  now  turn  to  the  database  transaction  concurrency  control  issue 
in  this  particular  architecture. 

9.3.3  CONCURRENCY  CONTROL  IN  AN  T5M  ARCHITECTURE  WITH  STRUCTURED  CLUS¬ 
TERING 


One  aspect  of  reasearch  in  the  INFOPLEX  Functional  Hierarchy  is  a 
methodology  for  functional  decomposition.  The  goal  of  functional  decom¬ 
position  is  to  distribute  functionalities  of  a  database  and  transaction 
processing  system  among  the  levels  of  the  Functional  Hierarchy  so  as  to 


minimize  inter-level  interferences.  Several  considerations  must  be  tak- 


en  into  the  specification  of  a  methodology.  First,  as  specified  by  the 
architecture,  inter-level  communication  strictly  takes  the  form  of  mes¬ 
sage  passing  and  no  shared  data  is  allowed  between  levels.  This  imposes 
a  constraint  on  the  acceptable  decompositions  that  modules  sharing  the 
same  data  files  or  tables  are  better  clustered  together  into  one  level. 
Otherwise,  these  modules  would  incur  a  heavy  overhead  of  message  passing 
to  get  jobs  done.  Other  considerations  for  decomposition  include  fre¬ 
quencies  and  sizes  of  inter-module  communication.  For  those  modules 
that  communicate  frequently  and  involve  large  quantity  of  data  being 
passed  in  between,  it  is  better  to  cluster  them  into  one  level  than  to 
separate  them- 

Finally,  there  is  the  need  for  inter-level  communication  due  to  data¬ 
base  concurrency  control.  Even  though  the  current  structured  architec¬ 
ture  enables  a  process  executed  within  a  particular  level  to  access  only 
data  elements  within  that  level,  a  transaction  is  nevertheless  an  entity 
that  spans  the  boundary  of  processes.  Control  within  a  level  that 
ensures  mutual  exclusion  of  processes  within  that  level  is  not  suffi¬ 
cient  to  guarantee  transaction  level  serialiability .  For  example,  if 
Pll  and  P12  are  two  processes  executing  at  Levels  l  and  2  on  behalf  of 
transaction  i,  and  ?2i  and  P22  are  two  processes  executing  at  Levels  l 
and  2  on  behalf  of  transaction  2,  and  Pll  is  serialized  to  be  before  P21 
at  Level  l  while  P12  is  serialized  to  be  after  P22  at  Level  2,  then  the 
overall  sequence  cannot  produce  an  equivalent  serialized  execution  of 


these  two 


ransactions, 


ar.d  therefore  the  transaction  level  database 


consistency  is  violated. 

|  To  provide  a  formal  treatment  of  this  problem,  we  will  first  give  a 

formal  definition  of  control  and  data  flow  in  the  Functional  Hierarchy 
and  then  proceed  to  relate  the  HDD  approach  to  concurrency  control  to 
S  functional  decomposition  of  Functional  Hierarchy. 

9. 3. 3.1  FORMAL  DEFINITION  OF  CONTROL  AND  DATA  FLOW  IN  INFOPLEX 

OVERVIEW:  We  define  a  generalized  multi-level  database  computer 

architecture  as  follows.  A  multi-level  database  computer  is  composed  of 
n  (n  >  i)  levels.  The  top  level  interfaces  with  facilities  that  commu¬ 
nicate  with  the  external  world  such  as  user  terminals  and  telecommuni¬ 
cation  channels,  and  the  bottom  level  implements  a  storage  subsystem 
which  communicates  with  all  other  levels  in  the  database  computer  via  a 
virtual  storage  interface. 

Each  level  in  the  system  has  an  Inter-level  Request  Queue  (IRQ)  from 
which  the  level  obtains  requests  to  be  processed.  Inter-level  communi¬ 
cation  takes  the  form  of  inter-level  requests  (IR's);  no  shared  memory 
is  allowed.  A  level  is  physically  implemented  by  a  collection  of 
processors  and  memory  modules  linearly  linked  together.  The  system  is 
initialized  with  empty  IRQ's.  All  processors  at  a  level  are  'idle'. 


Upon  arrival  of  an  IR  in  the  IRQ  of  that  level,  an  idle  processor  will 


|  H  ill  ■  I  p  ^ 


examine  the  nature  of  the  IS  and  take  some  appropriate  action.  Specif¬ 
ically,  if  the  IS  is  a  omcjid  K&qutit,  a  new  process  is  created  to 
process  this  IR,  and  the  processor  is  dispatched  to  run  the  process.  If 
the  IR  is  a  Kttivun  IR,  it  is  channeled  to  an  existing  process  that  this 
IR  is  addressed  to.  Therefore,  a  request  may  either  be  directed  to  an 
existing  process  at  a  level,  or  it  may  cause  a  new  process  to  be  gener¬ 
ated.  We  assume  that  the  arrival  of  a  request  is  the  only  cause  for  a 
new  process  to  be  generated.  Processing  of  a  request  at  a  level  may 
cause  new  requests  to  be  generated,  which  may  then  be  deposited  at  the 
local  IRQ  or  other  IRQ's  in  the  system.  Processing  of  a  request  may  also 
be  suspended  pending  arrival  of  some  other  requests.  Upon  termination, 
a  process  destroys  itself  and  releases  the  processor.  Barring  system 
failures,  processing  of  a  request  will  always  terminate. 

TRANSACTIONS  AND  REQUESTS;  The  system  is  driven  by  arrival  of  iueA 
tru1yua.cX4.0ru.  Arrival  of  a  transaction  is  signified  by  the  deposit  of  a 
request  from  this  transaction  into  the  IRQ  of  the  first  level.  In  proc¬ 
essing  this  request,  new  request  may  be  generated  and  deposited  at 
IRQ's.  These  new  requests  are,  in  general,  labelled  with  the  identity 
of  the  transaction,  and  are  called  request  on  bzkaJL  (J  0{j  that 
transaction.  Therefore,  all  the  requests  processed  in  the  system  are 
identified  with  some  transaction.  There  will  be  no  zxptictt 
intZA~ transaction  communtcaXton  in  the  form  of  request  passing. 
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OBJECTS:  Each  levei  i  maintains  a  set  of  distinct  objects  that  are 
to  be  shared  by  different  transactions  and  requests-  An  object  is  a 
smallest  stored  unit  of  data  at  a  level.  It  is  through  this  set  of 
objects  that  the  effect  of  a  request  at  a  level  can  be  communicated  to 
another  request  (in  addition  to  explicit  request  passing  among  those 
requests  on  behalf  of  the  same  transaction.)  We  shall  denote  the  set  of 
objects  at  the  level  i  as  0(i).  Examples  of  object  sets  are  virtual 
information  definition  table  at  the  Virtual  Information  level  and  secu¬ 
rity  definition  at  the  Security  Level. 

Object  elements  in  0(i)  can  be  examined,  created,  destroyed  and 
modifed  by  processes  at  level  i.  (For  the  purpose  of  studying  concur¬ 
rency  control,  we  will  not  include  read-only  objects  at  level  i  in  the 
object  set  o(i).)  Each  object  has  some  name(s)  associated  with  it,  and 
an  operation  will  always  name  the  object  to  be  operated  on.  we  will 
also  denote  the  union  of  all  object  sets  O(i)  in  the  DBM  to  be  0. 

S.3.3.2  FORMAL  CONDITIONS  FOR  CONSISTENCY 

The  object  is  the  focal  point  of  database  concurrency  control.  The 
premise  of  the  correctness  of  a  concurrency  control  mechanism  is  that, 
if  no  concurrent  processing  of  multiple  transactions  exist,  that  is,  if 
a  transaction  is  not  started  until  the  previous  one  is  finished,  then 
the  database  will  be  left  in  a  consistent  state  at  the  end  of  processing 
of  a  transaction. 


The  notion  of  direct  dependencies  among  transactions  and  that  of  a 
transaction  dependency  graph,  developed  in  Chapter  two,  are  directly 
applicable  to  each  level  of  the  multi-level  DBM  defined  here.  We 
rephrase  these  notions  in  our  context  as  follows: 

Definition,  a  transaction  dependency  graph  of  a  schedule  s  at  level 
i  of  a  multi-level  DBM  is  a  digraph,  denoted  as  TG(1>(S),  where  the 
nodes  are  the  transactions  in  S  and  the  arcs,  representing  direct 
dependenci es  between  transactions,  exist  according  to  the  following 
rules: 

t2  -  ^  iff  for  some  object  element  d  e  O(i), 

(1)  w^d-*)  and  r^d1)  are  in  S  for  some  dj,  or 

(2)  r<(dj)  and  w2(dk)  are  in  S  for  some  dJ,dk  where  dj  is  the 
predecessor  of  dk,  or 

(3)  Wi(dj),  w2(dk)  and  r2(dk)  are  in  S  and  dj  is  a  predecessor 
of  dk. 

Since  by  our  definition  of  the  multi-level  DBM,  the  object  sets  0(i) 
and  0(j),  for  levels  i  and  j  where  i  *  j,  do  not  intersect,  therefore 
one  can  express  the  condition  for  overall  serializability  in  terms  of 
the  acyclicity  condition  on  the  union  of  the  transaction  dependency 
graphs  TG(1'(S),  i  *  i,  n.  This  is  formally  described  in  the  fol¬ 

lowing  proposition: 


Proposition  1 


Consistency  of  a  multi-level  DSM  is  maintained  if  the  global  trans¬ 
action  dependency  graph,  defined  as  Union (TG( 1  * (S) )  over  ail  i  =  1, 
. . . ,  n,  is  acyclic. 

To  ensure  the  above  condition,  each  level  must  first  ensure  that 
TGM,(S)  is  acyclic,  and  a  global  mechanism  must  exist  which  ensures 
that  if  t «  ->  t2  in  TG''*(S)  for  some  i,  then  there  exists  no  path  t2  -> 
_  ->  ti  in  another  TG<j,(S)  for  any  j  *  i. 

S.3.3.3  APPLYING  THE  HDD  APPROACH  TO  CONCURRENCY  CONTROL  IN  INFOPLEX 

What  has  been  stated  is  the  condition  for  database  transaction  con¬ 
sistency  to  hold  in  a  multi-level  database  computer  as  defined  above. 
The  basic  notion  is  not  very  different  from  that  of  a  non-duplicated 
distributed  database. 

It  can  be  shown  that  either  the  two  phase  locking  algorithm  or  the 
conventional  timestamping  algorithm,  implemented  at  each  level,  will 
enforce  global  consistency.  This  follows  directly  from  the  fact  that 
both  algorithms  impose  a  partial  order  on  transaction  dependencies  and 
the  multi-level  DBM  architecture  does  not  alter  thxs  imposition.  Howev¬ 
er,  the  multi-level  multi-processor  DBM  environment  does  make  certain 
aspects  of  the  overhead  involved  in  these  conventional  algorithms  more 
costly.  We  will  elaborate  this  point  and  show  how  the  HDD  approach  cf 
concurrency  control  can  be  exploited  to  improve  this  situation. 


(I)  Inter-level  overhead:  If  the  two-phase  locking  algorithm  is  used, 
tr.e  mechanisms  used  to  prevent  illegal  interleaving  are  blocking  and 
deadlocks.  Slocking  takes  place  when  a  request  being  processed  at  level 
i  on  behalf  of  transaction  t,  attempts  to  access  an  object  in  0(i)  that 
is  currently  locked  in  an  incompatible  mode  by  another  transaction  t2- 
(Note  that  the  request  on  behalf  of  t2  that  locked  this  object  may  or 
may  not  be  in  existence  any  more  at  this  point;  however,  t2  must  still 
be  active  somewhere  at  this  point.)  Therefore  blocking  is  an  activity 
performed  strictly  within  a  level  and  does  not  require  inter-level  com¬ 
munication.  On  the  other  hand,  a  deadlock  may  occur  across  levels.  For 
example,  a  deadlock  may  occur  if  t,  is  blocked  by  t2  at  level  i,  t2 
blocked  by  t3  at  level  j,  and  t3  blocked  by  t,  at  level  k.  Distributed 
deadlocks  must  be  detected  and  resolved  using  distributed  deadlock 
detection  algorithms  (e.g.,  <0bermarck82>)  which  require  a  considerable 
amount  of  inter-level  coordination. 

The  timestamping  algoirthm  does  away  with  the  distributed  deadlock 
problem.  However,  if  a  transaction  writes  into  the  object  sets  of 
several  levels,  all  these  levels  must  be  involved  in  performing  atomic 
validation  and  commit  of  the  transaction.  This  means  that  every  object 
element  that  the  transaction  will  write  must  be  visited  and,  if  the 
write  request  is  validated,  a  temporary  lock  must  held  on  the  object 
until  the  transaction  finishes  writing  and  timestamping  ail  elements  and 
commits.  If  this  operation  involves  more  than  one  levels,  it  cannot  be 
expected  to  be  executed  as  fast  as  it  could  be  if  only  one  processor  is 


involved  in  performing  it  in  a  single  critical  section.  Therefore  the 
multi-level  environment  degrades  the  efficiency  of  the  algorithm  and 
much  inter-level  communication  for  the  purpose  of  concurrency  control 
may  still  be  incurred. 

To  alleviate  this  problem,  a  guideline  should  be  used  that  encourages 
the  design  of  the  Functional  Hierarchy  to  cluster  objects  that  are  to  be 
updated  in  a  single  transaction  into  object  set(s)  of  one  or  only  a  few 
levels.  In  other  words,  this  'update  cluster'  property  is  one  of  the 
aspects  that  the  functional  decomposition  methodology  should  achieve  in 
order  to  reduce  inter-level  communication.  Another  motivation  for 
accommodating  the  update  cluster  property  in  functional  decomposition  is 
that  such  decomposition  can  also  benefit  from  the  Hierarchical 
Timestamping  Algorithm's  capability  of  reducing  intra-level 
inter-processor  communication  overhead.  This  latter  point  is  elaborated 
in  the  following  paragraphs. 

(2)  Intra-level  overhead:  As  discussed  in  the  beginning  of  this  the¬ 
sis,  setting  locks  or  leaving  timestamps  require  the  processor  to  first 
obtain  the  exclusive  right  to  enter  a  critical  section.  This  is 
required  because  lock  tables  or  timestamp  tables  are  themselves  shared 
system  resources.  Alternatively,  a  processor  can  send  a  message  to  a 
dedicated  process  (or  processor)  which  manages  the  tables,  and  wait  for 
responses  from  the  dedicated  process.  Either  method  requires 
inter-processor  coordination  within  a  level  of  the  DBM  to  ensure  that 
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the  atomicity  cf  the  activity  of  setting  locks  or  timestamping  is 
enforced.  This  form  of  inter-processor  data  sharing  is  one  of  the  fac¬ 
tors  that  limit  the  effective  number  of  processors  a  level  of  the  DBM 
can  have.  (For  a  discussion  of  this  concept  cf  software  lockout  in  a 
genreal  multi-processor  system,  see  <Kadnick63>. ) 

Specifically,  the  timestamping  algorithm  involves  the  need  to 
timestamp  every  data  element  accessed  by  a  process  executing  on  behalf 
of  a  particular  transaction.  A  processor  attempting  to  obtain  the  right 
to  enter  a  critical  section  in  order  to  read-timestamp  a  data  element 
would  be  forced  to  wait  if  the  critical  section  is  at  the  same  time 
being  occupied  by  another  processor  attempting  to  perform  timestamping. 
An  algorithm  that  helps  reducing  this  form  of  inter-processor  interfer¬ 
ence  would  have  the  effect  of  increasing  the  effective  level  of 
multiprocessing  within  a  level. 

Therefore,  a  functional  decomposition  methodology  and  an  accompanying 
transaction  design  that  encourages  the  write  sets  of  transactions  in  the 
system  to  be  concentrated  in  one  or  a  few  levels  would  greatly  reduce 
the  inter-level  synchronization  overhead.  In  addition,  if  the  object 
sets  of  ail  che  levels  in  the  system  can  be  given  a  hierarchical  order 
such  that  most  of  the  transacitons  in  the  system  would  read  from  object 
sets  that  are  of  higher  hierarcical  order  than  the  object  sets  their 
write  set  is  resided,  then  the  Hierarchical  Timestamping  Algorithm  can 
be  profitably  used  to  reduce  the  need  to  read  timestamp  data  access 


requests  for  certain  set  of  transactions 


and  therefore  reduce  the 


intra-level  inter-processor  interference. 

An  example  illustrating  the  applicaiton  of  the  HDD  approach  of  con¬ 
currency  control  to  the  INFOPLEX  database  computer  architecture  is  shown 
in  Figure  51.  In  this  example,  the  access  method  portion  of  the  DBM  is 
decomposed  into  three  levels:  the  high-level  index  processing  level 
(LI) ,  the  leaf-level  index  ■  processing  level  (L2)  and  the  data  record 
access  level  (L3) .  Three  classes  of  transactions  are  recognized  by  this 
portion  of  the  system:  key-updaters,  non-key-updaters  and 
index- balancing  requets.  •  LI  owns  the  object  set  of  all  high-level  index 
pages  (Di),  L2  the  object  set  of  all  leaf-level  index  pages  (D2) ,  and  L3 
the  object  set  of  all  data  records  (D3)  A  key-updater  reads  rrom  D 1  and 
writes  to  D2  and  D3.  A  non-key-updater  reads  from  D 1  and  D2  and  writes 
to  D3.  Finally  an  index-balancing  request  would  operate  only  in  Di. 
With  this  design,  the  key-updaters' s  read  accesses  to  D,  would  not  incur 
inter-processor  overhead  in  LI.  Similarly,  the  non-key-updaters'  read 
accesses  to  Di  and  D2  would  not  incur  inter-processor  overhead  in  Li  and 
L2. 


In  summary,  this  section  has  provided  a  formal  treatment  of  the  con¬ 
currency  control  problem  in  the  Functional  Hierarchy  of  the  database 
computer  INFOPLEX.  It  defines  the  formal  conditions  for  database  trans¬ 
action  level  of  consistency,  and  shows  how  this  should  be  taken  into 
consideration  as  one  of  the  aspects  that  produce  inter-level  interfer- 
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er.ce.  It  then  shows  how  the  Hierarchical  Timestamping  Algorithm,  by 
taking  advantages  of  the  need  to  structure  the  DBMS  functionalities  to 
reduce  inter-level  interference,  can  produce  the  effect  of  reducing 
intra-level  inter-processor  interferences.  In  general,  the  HDD  perspec¬ 
tive  of  concurrency  control  is  expected  to  introduce  the  dimension  of 
data  decomposition  into  the  formulation  of  a  functional  decomposition 


methodology. 


SUMMARY  AND  FUTURE  RESEARCH  DIRECTIONS 


1C.1  SUMMARY 


Database  management  systems  are  emerging  as  a  crucial  element  of 
today's  business  operations.  To  fully  exploit  inherent  parallelism  in  a 
computer  system  and  to  improve  transaction  response  time,  a  DBMS  must 
support  multiple  users  at  the  same  time,  allowing  multiple  transactions 
to  run  in  parallel.  However,  interleaving  of  the  execution  of  multiple 
transactions  may  result  in  violation  of  database  integrity.  To  prevent 
the  latter  from  happening,  a  concurrency  control  facility  must  be 
included  in  a  DBMS. 

There  is  a  growing  concern  over  the  efficiency  of  the  concurrency 
control  facility  and  its  impact  on  the  performance  of  a  DBMS.  Conven¬ 
tional  approaches  to  concurrency  control,  taxing  serializability  as  the 
criterion  of  correctness,  require  that  every  read  and  write  access 
request  from  a  transaction  be  controlled  by  leaving  a  'trace'  (e.g.,  a 
iocx  or  a  timestamp)  in  the  system.  Setting  locks  or  timestamping  is  an 
expensive  operaton  v  ich  not  only  incurs  operatonal  overhead  but  also 
produces  some  inter-transaction  conflicts  that  could  be  otherwise 
avoided.  The  purpose  of  the  current  research  is  to  seek  ways  to  reduce 
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the  overhead  of  synchronizing  certain  types  of  read  accesses  while 
achieving  the  goal  of  serializability . 

To  this  end,  a  new  approach  of  concurrency  control  for  database  man¬ 
agement  systems  has  been  proposed.  The  technique  makes  use  of  a  hierar¬ 
chical  database  decomposition,  a  procedure  which  decomposes  the  entire 
database  into  a  hierarchy  of  data  segments  based  on  the  access  pattern 
of  the  update  transactions  to  be  run  in  the  system.  A  corresponding 
classification  of  the  update  transactions  is  derived  where  each  trans¬ 
action  class  is  'rooted'  in  one  of  the  data  segments.  The  heart  of  the 
new  approach  lies  with  differentiating  the  data  accesses  of  an  update 
transaction  into  three  types:  those  accesses  to  the  transaction’s  own 
root  data  segment,  those  to  data  segments  higher  than  the  transaction's 
own  root  data  segment,  and  those  to  data  segments  lower  than  the  root. 
The  algorithm  consists  of  three  protocols:  Protocol  E  for  accessing 
data  elements  within  the  root  segment;  Protocol  H  for  accessing  higher 
data  segments;  and  Protcol  L  for  accessing  lower  data  segments.  When 
the  data  segment  hierarchy  consists  of  only  one  data  segment,  the 
hierarchical  uimestamp  algorithm  is  reduced  to  the  conventional 
multi-version  timestamp  algorithm. 

The  potential  benefit  of  this  algorithm  stems  from  the  usage  of  Pro¬ 
tocol  H  (the  'cheap'  protocol).  When  Protocol  H  is  applicable,  the 
algorithm  enables  read  accesses  to  higher  data  segments  to  proceed  with¬ 


out  ever  having  to  wait  or  to  leave  any  trace  of  these  accesses,  thereby 


reducing  me  overhead  of  concurrency  control.  A  protocol  (Protocol  R) 
for  handling  ad-hoc  read-only  transactions  in  this  environment  is  also 
devised,  which  does  not  require  read-only  transactions  to  wait  or  set 
any  read  timestamp.  The  proof  of  correctness  of  these  algorithms  in 
terms  of  their  preservation  of  serializability  is  provided  through  a  set 
of  properties,  lemmas  and  theorems  which  center  around  a  new  ordering 
concept  called  'topologically  follows.'  These  results  are  resported  in 
Chapter  three  to  Chapter  six. 

An  implementation  scheme  for  the  proposed  Hierarchical  Timestamping 
Algorithm  is  described  in  Chapter  seven.  It  addresses  both  the  problems 
of  multi-version  database  maintenance  and  timestamp  management  and 
attempts  to  achieve  the  goal  of  maximum  parallelism  within  the  concur¬ 
rency  control  facility.  In  Chapter  eight,  the  hierarchical  database 
decomposition  problem,  i.e.,  the  problem  of  how  to  find  the  optimal 
database  partition  and  the  data  segment  hierarchy  that  maximize  the  gain 
of  using  Protocol  H,  is  formulated,  and  its  complexity  analyzed.  The 
problem  is  shown  to  be  NP-hard  and  a  heuristic  procedure  is  also  pro¬ 
posed.  Finally,  in  Chapter  nine,  the  HDD  approch  is  applied  to  three 
different  application  areas  in  which  its  effect  on  relieving  database 
contention  and  on  structuring  databases  and  transactions  so  as  to  reduce 
concurrency  control  overhead  is  demonstrated. 


1C.2  FUTURE  RESEARCH  DIRECTIONS 


The  thrust  of  the  research  is  more  than  proposing  a  new  concurrency 
control  algorithm,  it  demonstrates  the  potential  benefit  of  exploiting 
the  knowledge  and  the  structure  of  the  application  systems  to  implement 
more  efficient  and  more  tailored  concurrency  control  mechanism.  Combin¬ 
ing  this  realization  and  other  aspects  of  this  research,  we  propose  the 
following  future  research  directions. 

(1)  Additional  studies  for  providing  further  evidences  of  the  exist¬ 
ence  of  hierarchical  structural  disciplines  and  applicability  of  the  HDD 
approach:  While  Chapter  nine  has  provided  an  exploratory  study  of  the 
application  of  the  HDD  approach,  further  studies  are  needed  in  forms  of 
case  studies  and  performance  modelling  tools.  In  addition,  specific 
database  and  transaction  design  guidelines  that  enable  the  application 
base  to  be  broadened  need  to  be  formalized. 

(2)  Exploration  of  other  algorithms  that  take  advantages  of  the  l 
hierarchical  structural  discipline  in  a  transaction  processing  system: 

The  current  algorithm  is  based  on  the  idea  that  a  concurrency  control 
algorithm  can  assign  ar.  array  of  timestamps,  rather  than  a  single 
timestamp,  to  a  transaction.  This  approach  increases  the  flexibility  of 
a  timestamp-based  algorithm.  However,  research  into  alternative 


timestamp-based  algorithms  which  make  use  of  a  single  timestamp  and  yet 


still  produce  similar  flexibility  may  prove  to  be  a  fruitful  one.  If 
this  proves  to  be  attainable,  then  different  algorithms  achieving  the 
same  goal  of  reducing  synchronization  overhead  but  of  different  perform¬ 
ance  features  may  be  made  available  to  system  designers.  Along  the  same 
vein,  theoretical  and  practical  investigations  into  whether  similar 
flexibility  can  be  built  into  lock-based  algorithms  will  also  be  inter¬ 
esting  undertakings. 

(3)  Exploration  of  other  forms  of  structural  disciplines  that  may 
render  efficient  concurrency  control  algorithms:  Extending  the  original 
idea  presented  in  the  research  of  SDD-l,  transaction  analysis  can  con¬ 
tinue  to  be  used  as  a  tool  for  discovering  other  forms  of  structural 
disciplines  •  inherent  in  application  studies.  Moreover,  current 
reasearch  has  but  taken  on  the  method  of  syntactic  analysis  of  trans¬ 
actions  to  devise  alternative  algorithms.  Semantic  explorations,  on  the 
other  hand,  may  yield  even  more  powerful  algorithms.  This  approach 
would  be  especially  useful  in  systems  where  transaction  types  are  very 
well  formulated  and  understood,  and  large  volume  of  transactions  of  a 
limited  number  of  transaction  types  are  repetitively  executed. 

(4)  Multiprocessor  database  computer-specific  studies:  In  order  to 
fully  utilize  existing  and  emerging  processor  and  memory  technologies, 
the  problem  of  how  to  intelligently  organize  components  of  a  large-scale 
multiprocessor  database  computer  so  as  to  exploit  the  maximum  level  of 
concurrency  expected  in  an  application  system  presents  a  challenge  all 
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by  itss 


The  current  research  has  only  proposed  one  approach  to  deal¬ 


ing  with  the  problem,  focusing  on  the  database  concurrency  control 
issue.  The  general  philosophy  advocated  in  this  research  is  that  there 
must  be  special  structural  disciplines,  for  reasons  related  or  unrelated 
to  increasing  parallelism,  that  are  to  be  incorporated  into  the  multi¬ 
processor  database  computer,  and  these  disciplines  present  opportunities 
for  one  to  design  algorithms  that  are  more  intelligent  than  conventional 
ones.  Of  specific  interests  in  this  research  direction  are  performance 
modeling  of  impact  of  concurrency  control  in  a  mutliprocessor  system, 
and  exploration  of  other  methods  that  reduce  inter-processor  and 


inter-cluster  communication 
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